an overview of the psych package - the personality project

128
An overview of the psych package William Revelle Department of Psychology Northwestern University January 7, 2017 Contents 0.1 Jump starting the psych package–a guide for the impatient ......... 4 1 Overview of this and related documents 6 2 Getting started 7 3 Basic data analysis 8 3.1 Data input from the clipboard ......................... 8 3.2 Basic descriptive statistics ............................ 9 3.2.1 Outlier detection using outlier .................... 10 3.2.2 Basic data cleaning using scrub .................... 12 3.2.3 Recoding categorical variables into dummy coded variables ..... 12 3.3 Simple descriptive graphics ........................... 12 3.3.1 Scatter Plot Matrices .......................... 13 3.3.2 Density or violin plots .......................... 13 3.3.3 Means and error bars .......................... 17 3.3.4 Error bars for tabular data ....................... 17 3.3.5 Two dimensional displays of means and errors ............. 21 3.3.6 Back to back histograms ......................... 23 3.3.7 Correlational structure .......................... 24 3.3.8 Heatmap displays of correlational structure .............. 25 3.4 Testing correlations ................................ 25 3.5 Polychoric, tetrachoric, polyserial, and biserial correlations .......... 31 3.6 Multiple regression from data or correlation matrices ............. 31 3.7 Mediation and Moderation analysis ....................... 35 1

Upload: others

Post on 11-Sep-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An overview of the psych package - The Personality Project

An overview of the psych package

William RevelleDepartment of PsychologyNorthwestern University

January 7 2017

Contents

01 Jump starting the psych packagendasha guide for the impatient 4

1 Overview of this and related documents 6

2 Getting started 7

3 Basic data analysis 831 Data input from the clipboard 832 Basic descriptive statistics 9

321 Outlier detection using outlier 10322 Basic data cleaning using scrub 12323 Recoding categorical variables into dummy coded variables 12

33 Simple descriptive graphics 12331 Scatter Plot Matrices 13332 Density or violin plots 13333 Means and error bars 17334 Error bars for tabular data 17335 Two dimensional displays of means and errors 21336 Back to back histograms 23337 Correlational structure 24338 Heatmap displays of correlational structure 25

34 Testing correlations 2535 Polychoric tetrachoric polyserial and biserial correlations 3136 Multiple regression from data or correlation matrices 3137 Mediation and Moderation analysis 35

1

4 Item and scale analysis 3641 Dimension reduction through factor analysis and cluster analysis 39

411 Minimum Residual Factor Analysis 41412 Principal Axis Factor Analysis 42413 Weighted Least Squares Factor Analysis 42414 Principal Components analysis (PCA) 48415 Hierarchical and bi-factor solutions 48416 Item Cluster Analysis iclust 52

42 Confidence intervals using bootstrapping techniques 5543 Comparing factorcomponentcluster solutions 5544 Determining the number of dimensions to extract 61

441 Very Simple Structure 63442 Parallel Analysis 64

45 Factor extension 6646 Exploratory Structural Equation Modeling (ESEM) 66

5 Classical Test Theory and Reliability 7051 Reliability of a single scale 7152 Using omega to find the reliability of a single scale 7553 Estimating ωh using Confirmatory Factor Analysis 79

531 Other estimates of reliability 8054 Reliability and correlations of multiple scales within an inventory 81

541 Scoring from raw data 81542 Forming scales from a correlation matrix 83

55 Scoring Multiple Choice Items 8556 Item analysis 87

561 Exploring the item structure of scales 87562 Empirical scale construction 89

6 Item Response Theory analysis 9061 Factor analysis and Item Response Theory 9162 Speeding up analyses 9563 IRT based scoring 96

631 1 versus 2 parameter IRT scoring 100

7 Multilevel modeling 10271 Decomposing data into within and between level correlations using statsBy 10372 Generating and displaying multilevel data 10373 Factor analysis by groups 104

8 Set Correlation and Multiple Regression from the correlation matrix 104

2

9 Simulation functions 107

10 Graphical Displays 109

11 Converting output to APA style tables using LATEX 109

12 Miscellaneous functions 111

13 Data sets 112

14 Development version and a users guide 114

15 Psychometric Theory 114

16 SessionInfo 114

3

01 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 2) and you want to use it without reading much moreWhat should you do

1 Activate the psych package

library(psych)

2 Input your data (section 31) Go to your friendly text editor or data manipulationprogram (eg Excel) and copy the data to the clipboard Include a first line that hasthe variable labels Paste it into psych using the readclipboardtab command

myData lt- readclipboardtab()

3 Make sure that what you just read is right Describe it (section 32) and perhapslook at the first and last few lines

describe(myData)

headTail(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables lookat the SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 331)Even better use the outlier to detect outliers

pairspanels(myData)

outlier(myData)

5 Note that you have some weird subjects probably due to data entry errors Eitheredit the data by hand (use the edit command) or just scrub the data (section 322)

cleaned lt- scrub(myData max=9) eg change anything great than 9 to NA

6 Graph the data with error bars for each variable (section 333)

errorbars(myData)

7 Find the correlations of all of your data

bull Descriptively (just the values) (section 337)

r lt- lowerCor(myData)

bull Graphically (section 338)

corPlot(r)

bull Inferentially (the values the ns and the p values) (section 34)

corrtest(myData)

8 Test for the number of factors in your data using parallel analysis (faparallelsection 442) or Very Simple Structure (vss 441)

faparallel(myData)

vss(myData)

4

9 Factor analyze (see section 41) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 411-413)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 416) Also consider a hierarchical factor solution to findcoefficient ω (see 415)

fa(myData)

iclust(myData)

omega(myData)

If you prefer to do a principal components analysis you may use the principalfunction The default is one component

principal(myData)

10 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 51) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreItems function (see 54) and to find various estimates of internal consistencyfor these scales find their intercorrelations and find scores for all the subjects

alpha(myData) score all of the items as part of one scale

myKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))

myscores lt- scoreItems(myKeysmyData) form several scales

myscores show the highlights of the results

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading this entire vignettehelpful to get a broader understanding of what can be done in R using the psych Rememberthat the help command () is available for every function Try running the examples foreach help page

5

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 2: An overview of the psych package - The Personality Project

4 Item and scale analysis 3641 Dimension reduction through factor analysis and cluster analysis 39

411 Minimum Residual Factor Analysis 41412 Principal Axis Factor Analysis 42413 Weighted Least Squares Factor Analysis 42414 Principal Components analysis (PCA) 48415 Hierarchical and bi-factor solutions 48416 Item Cluster Analysis iclust 52

42 Confidence intervals using bootstrapping techniques 5543 Comparing factorcomponentcluster solutions 5544 Determining the number of dimensions to extract 61

441 Very Simple Structure 63442 Parallel Analysis 64

45 Factor extension 6646 Exploratory Structural Equation Modeling (ESEM) 66

5 Classical Test Theory and Reliability 7051 Reliability of a single scale 7152 Using omega to find the reliability of a single scale 7553 Estimating ωh using Confirmatory Factor Analysis 79

531 Other estimates of reliability 8054 Reliability and correlations of multiple scales within an inventory 81

541 Scoring from raw data 81542 Forming scales from a correlation matrix 83

55 Scoring Multiple Choice Items 8556 Item analysis 87

561 Exploring the item structure of scales 87562 Empirical scale construction 89

6 Item Response Theory analysis 9061 Factor analysis and Item Response Theory 9162 Speeding up analyses 9563 IRT based scoring 96

631 1 versus 2 parameter IRT scoring 100

7 Multilevel modeling 10271 Decomposing data into within and between level correlations using statsBy 10372 Generating and displaying multilevel data 10373 Factor analysis by groups 104

8 Set Correlation and Multiple Regression from the correlation matrix 104

2

9 Simulation functions 107

10 Graphical Displays 109

11 Converting output to APA style tables using LATEX 109

12 Miscellaneous functions 111

13 Data sets 112

14 Development version and a users guide 114

15 Psychometric Theory 114

16 SessionInfo 114

3

01 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 2) and you want to use it without reading much moreWhat should you do

1 Activate the psych package

library(psych)

2 Input your data (section 31) Go to your friendly text editor or data manipulationprogram (eg Excel) and copy the data to the clipboard Include a first line that hasthe variable labels Paste it into psych using the readclipboardtab command

myData lt- readclipboardtab()

3 Make sure that what you just read is right Describe it (section 32) and perhapslook at the first and last few lines

describe(myData)

headTail(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables lookat the SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 331)Even better use the outlier to detect outliers

pairspanels(myData)

outlier(myData)

5 Note that you have some weird subjects probably due to data entry errors Eitheredit the data by hand (use the edit command) or just scrub the data (section 322)

cleaned lt- scrub(myData max=9) eg change anything great than 9 to NA

6 Graph the data with error bars for each variable (section 333)

errorbars(myData)

7 Find the correlations of all of your data

bull Descriptively (just the values) (section 337)

r lt- lowerCor(myData)

bull Graphically (section 338)

corPlot(r)

bull Inferentially (the values the ns and the p values) (section 34)

corrtest(myData)

8 Test for the number of factors in your data using parallel analysis (faparallelsection 442) or Very Simple Structure (vss 441)

faparallel(myData)

vss(myData)

4

9 Factor analyze (see section 41) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 411-413)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 416) Also consider a hierarchical factor solution to findcoefficient ω (see 415)

fa(myData)

iclust(myData)

omega(myData)

If you prefer to do a principal components analysis you may use the principalfunction The default is one component

principal(myData)

10 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 51) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreItems function (see 54) and to find various estimates of internal consistencyfor these scales find their intercorrelations and find scores for all the subjects

alpha(myData) score all of the items as part of one scale

myKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))

myscores lt- scoreItems(myKeysmyData) form several scales

myscores show the highlights of the results

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading this entire vignettehelpful to get a broader understanding of what can be done in R using the psych Rememberthat the help command () is available for every function Try running the examples foreach help page

5

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 3: An overview of the psych package - The Personality Project

9 Simulation functions 107

10 Graphical Displays 109

11 Converting output to APA style tables using LATEX 109

12 Miscellaneous functions 111

13 Data sets 112

14 Development version and a users guide 114

15 Psychometric Theory 114

16 SessionInfo 114

3

01 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 2) and you want to use it without reading much moreWhat should you do

1 Activate the psych package

library(psych)

2 Input your data (section 31) Go to your friendly text editor or data manipulationprogram (eg Excel) and copy the data to the clipboard Include a first line that hasthe variable labels Paste it into psych using the readclipboardtab command

myData lt- readclipboardtab()

3 Make sure that what you just read is right Describe it (section 32) and perhapslook at the first and last few lines

describe(myData)

headTail(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables lookat the SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 331)Even better use the outlier to detect outliers

pairspanels(myData)

outlier(myData)

5 Note that you have some weird subjects probably due to data entry errors Eitheredit the data by hand (use the edit command) or just scrub the data (section 322)

cleaned lt- scrub(myData max=9) eg change anything great than 9 to NA

6 Graph the data with error bars for each variable (section 333)

errorbars(myData)

7 Find the correlations of all of your data

bull Descriptively (just the values) (section 337)

r lt- lowerCor(myData)

bull Graphically (section 338)

corPlot(r)

bull Inferentially (the values the ns and the p values) (section 34)

corrtest(myData)

8 Test for the number of factors in your data using parallel analysis (faparallelsection 442) or Very Simple Structure (vss 441)

faparallel(myData)

vss(myData)

4

9 Factor analyze (see section 41) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 411-413)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 416) Also consider a hierarchical factor solution to findcoefficient ω (see 415)

fa(myData)

iclust(myData)

omega(myData)

If you prefer to do a principal components analysis you may use the principalfunction The default is one component

principal(myData)

10 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 51) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreItems function (see 54) and to find various estimates of internal consistencyfor these scales find their intercorrelations and find scores for all the subjects

alpha(myData) score all of the items as part of one scale

myKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))

myscores lt- scoreItems(myKeysmyData) form several scales

myscores show the highlights of the results

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading this entire vignettehelpful to get a broader understanding of what can be done in R using the psych Rememberthat the help command () is available for every function Try running the examples foreach help page

5

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 4: An overview of the psych package - The Personality Project

01 Jump starting the psych packagendasha guide for the impatient

You have installed psych (section 2) and you want to use it without reading much moreWhat should you do

1 Activate the psych package

library(psych)

2 Input your data (section 31) Go to your friendly text editor or data manipulationprogram (eg Excel) and copy the data to the clipboard Include a first line that hasthe variable labels Paste it into psych using the readclipboardtab command

myData lt- readclipboardtab()

3 Make sure that what you just read is right Describe it (section 32) and perhapslook at the first and last few lines

describe(myData)

headTail(myData)

4 Look at the patterns in the data If you have fewer than about 10 variables lookat the SPLOM (Scatter Plot Matrix) of the data using pairspanels (section 331)Even better use the outlier to detect outliers

pairspanels(myData)

outlier(myData)

5 Note that you have some weird subjects probably due to data entry errors Eitheredit the data by hand (use the edit command) or just scrub the data (section 322)

cleaned lt- scrub(myData max=9) eg change anything great than 9 to NA

6 Graph the data with error bars for each variable (section 333)

errorbars(myData)

7 Find the correlations of all of your data

bull Descriptively (just the values) (section 337)

r lt- lowerCor(myData)

bull Graphically (section 338)

corPlot(r)

bull Inferentially (the values the ns and the p values) (section 34)

corrtest(myData)

8 Test for the number of factors in your data using parallel analysis (faparallelsection 442) or Very Simple Structure (vss 441)

faparallel(myData)

vss(myData)

4

9 Factor analyze (see section 41) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 411-413)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 416) Also consider a hierarchical factor solution to findcoefficient ω (see 415)

fa(myData)

iclust(myData)

omega(myData)

If you prefer to do a principal components analysis you may use the principalfunction The default is one component

principal(myData)

10 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 51) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreItems function (see 54) and to find various estimates of internal consistencyfor these scales find their intercorrelations and find scores for all the subjects

alpha(myData) score all of the items as part of one scale

myKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))

myscores lt- scoreItems(myKeysmyData) form several scales

myscores show the highlights of the results

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading this entire vignettehelpful to get a broader understanding of what can be done in R using the psych Rememberthat the help command () is available for every function Try running the examples foreach help page

5

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 5: An overview of the psych package - The Personality Project

9 Factor analyze (see section 41) the data with a specified number of factors (thedefault is 1) the default method is minimum residual the default rotation for morethan one factor is oblimin There are many more possibilities (see sections 411-413)Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm(Revelle 1979) (see section 416) Also consider a hierarchical factor solution to findcoefficient ω (see 415)

fa(myData)

iclust(myData)

omega(myData)

If you prefer to do a principal components analysis you may use the principalfunction The default is one component

principal(myData)

10 Some people like to find coefficient α as an estimate of reliability This may be donefor a single scale using the alpha function (see 51) Perhaps more useful is theability to create several scales as unweighted averages of specified items using thescoreItems function (see 54) and to find various estimates of internal consistencyfor these scales find their intercorrelations and find scores for all the subjects

alpha(myData) score all of the items as part of one scale

myKeys lt- makekeys(nvar=20list(first = c(1-35-7810)second=c(24-61115-16)))

myscores lt- scoreItems(myKeysmyData) form several scales

myscores show the highlights of the results

At this point you have had a chance to see the highlights of the psych package and todo some basic (and advanced) data analysis You might find reading this entire vignettehelpful to get a broader understanding of what can be done in R using the psych Rememberthat the help command () is available for every function Try running the examples foreach help page

5

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 6: An overview of the psych package - The Personality Project

1 Overview of this and related documents

The psych package (Revelle 2015) has been developed at Northwestern University since2005 to include functions most useful for personality psychometric and psychological re-search The package is also meant to supplement a text on psychometric theory (Revelleprep) a draft of which is available at httppersonality-projectorgrbook

Some of the functions (eg readclipboard describe pairspanels scatterhisterrorbars multihist bibars) are useful for basic data entry and descriptive analy-ses

Psychometric applications emphasize techniques for dimension reduction including factoranalysis cluster analysis and principal components analysis The fa function includesfive methods of factor analysis (minimum residual principal axis weighted least squaresgeneralized least squares and maximum likelihood factor analysis) Principal ComponentsAnalysis (PCA) is also available through the use of the principal function Determiningthe number of factors or components to extract may be done by using the Very SimpleStructure (Revelle and Rocklin 1979) (vss) Minimum Average Partial correlation (Velicer1976) (MAP) or parallel analysis (faparallel) criteria These and several other criteriaare included in the nfactors function Two parameter Item Response Theory (IRT)models for dichotomous or polytomous items may be found by factoring tetrachoric

or polychoric correlation matrices and expressing the resulting parameters in terms oflocation and discrimination using irtfa

Bifactor and hierarchical factor structures may be estimated by using Schmid Leimantransformations (Schmid and Leiman 1957) (schmid) to transform a hierarchical factorstructure into a bifactor solution (Holzinger and Swineford 1937)

Scale construction can be done using the Item Cluster Analysis (Revelle 1979) (iclust)function to determine the structure and to calculate reliability coefficients α (Cronbach1951)(alpha scoreItems scoremultiplechoice) β (Revelle 1979 Revelle and Zin-barg 2009) (iclust) and McDonaldrsquos ωh and ωt (McDonald 1999) (omega) Guttmanrsquos sixestimates of internal consistency reliability (Guttman (1945) as well as additional estimates(Revelle and Zinbarg 2009) are in the guttman function The six measures of Intraclasscorrelation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairspanels cor-relation ldquoheat mapsrdquo (corPlot) factor cluster and structural diagrams using fadiagramiclustdiagram structurediagram and hetdiagram as well as item response charac-teristics and item and test information characteristic curves plotirt and plotpoly

This vignette is meant to give an overview of the psych package That is it is meantto give a summary of the main functions in the psych package with examples of how

6

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 7: An overview of the psych package - The Personality Project

they are used for data description dimension reduction and scale construction The ex-tended user manual at psych_manualpdf includes examples of graphic output and moreextensive demonstrations than are found in the help menus (Also available at http

personality-projectorgrpsych_manualpdf) The vignette psych for sem atpsych_for_sempdf discusses how to use psych as a front end to the sem package of JohnFox (Fox et al 2012) (The vignette is also available at httppersonality-project

orgrbookpsych_for_sempdf)

For a step by step tutorial in the use of the psych package and the base functions inR for basic personality research see the guide for using R for personality research athttppersonalitytheoryorgrrshorthtml For an introduction to psychometrictheory with applications in R see the draft chapters at httppersonality-project

orgrbook)

2 Getting started

Some of the functions described in this overview require other packages Particularlyuseful for rotating the results of factor analyses (from eg fa factorminres factorpafactorwls or principal) or hierarchical factor models using omega or schmid is theGPArotation package These and other useful packages may be installed by first installingand then using the task views (ctv) package to install the ldquoPsychometricsrdquo task view butdoing it this way is not necessary

installpackages(ctv)

library(ctv)

taskviews(Psychometrics)

The ldquoPsychometricsrdquo task view will install a large number of useful packages To installthe bare minimum for the examples in this vignette it is necessary to install just 3 pack-ages

installpackages(list(c(GPArotationmnormt)

Because of the difficulty of installing the package Rgraphviz alternative graphics have beendeveloped and are available as diagram functions If Rgraphviz is available some functionswill take advantage of it An alternative is to useldquodotrdquooutput of commands for any externalgraphics package that uses the dot language

7

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 8: An overview of the psych package - The Personality Project

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptivestatistics

Remember to run any of the psych functions it is necessary to make the package activeby using the library command

gt library(psych)

The other packages once installed will be called automatically by psych

It is possible to automatically load psych and other functions by creating and then savinga ldquoFirstrdquo function eg

First lt- function(x) library(psych)

31 Data input from the clipboard

There are of course many ways to enter data into R Reading from a local file usingreadtable is perhaps the most preferred However many users will enter their datain a text editor or spreadsheet program and then want to copy and paste into R Thismay be done by using readtable and specifying the input file as ldquoclipboardrdquo (PCs) orldquopipe(pbpaste)rdquo (Macs) Alternatively the readclipboard set of functions are perhapsmore user friendly

readclipboard is the base function for reading data from the clipboard

readclipboardcsv for reading text that is comma delimited

readclipboardtab for reading text that is tab delimited (eg copied directly from anExcel file)

readclipboardlower for reading input of a lower triangular matrix with or without adiagonal The resulting object is a square matrix

readclipboardupper for reading input of an upper triangular matrix

readclipboardfwf for reading in fixed width fields (some very old data sets)

For example given a data set copied to the clipboard from a spreadsheet just enter thecommand

gt mydata lt- readclipboard()

This will work if every data field has a value and even missing data are given some values(eg NA or -999) If the data were entered in a spreadsheet and the missing values

8

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 9: An overview of the psych package - The Personality Project

were just empty cells then the data should be read in as a tab delimited or by using thereadclipboardtab function

gt mydata lt- readclipboard(sep=t) define the tab option or

gt mytabdata lt- readclipboardtab() just use the alternative function

For the case of data in fixed width fields (some old data sets tend to have this format)copy to the clipboard and then specify the width of each field (in the example below thefirst variable is 5 columns the second is 2 columns the next 5 are 1 column the last 4 are3 columns)

gt mydata lt- readclipboardfwf(widths=c(52rep(15)rep(34))

32 Basic descriptive statistics

Once the data are read in then describe or describeBy will provide basic descriptivestatistics arranged in a data frame format Consider the data set satact which in-cludes data from 700 web based participants on 3 demographic variables and 3 abilitymeasures

describe reports means standard deviations medians min max range skew kurtosisand standard errors for integer or real data Non-numeric data although the statisticsare meaningless will be treated as if numeric (based upon the categorical coding ofthe data) and will be flagged with an

describeBy reports descriptive statistics broken down by some categorizing variable (eggender age etc)

gt library(psych)

gt data(satact)

gt describe(satact) basic descriptive statistics

vars n mean sd median trimmed mad min max range skew

gender 1 700 165 048 2 168 000 1 2 1 -061

education 2 700 316 143 3 331 148 0 5 5 -068

age 3 700 2559 950 22 2386 593 13 65 52 164

ACT 4 700 2855 482 29 2884 445 3 36 33 -066

SATV 5 700 61223 11290 620 61945 11861 200 800 600 -064

SATQ 6 687 61022 11564 620 61725 11861 200 800 600 -059

kurtosis se

gender -162 002

education -007 005

age 242 036

ACT 053 018

SATV 033 427

SATQ -002 441

These data may then be analyzed by groups defined in a logical statement or by some othervariable Eg break down the descriptive data for males or females These descriptivedata can also be seen graphically using the errorbarsby function (Figure 5) By settingskew=FALSE and ranges=FALSE the output is limited to the most basic statistics

9

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 10: An overview of the psych package - The Personality Project

gt basic descriptive statistics by a grouping variable

gt describeBy(satactsatact$genderskew=FALSEranges=FALSE)

$`1`vars n mean sd se

gender 1 247 100 000 000

education 2 247 300 154 010

age 3 247 2586 974 062

ACT 4 247 2879 506 032

SATV 5 247 61511 11416 726

SATQ 6 245 63587 11602 741

$`2`vars n mean sd se

gender 1 453 200 000 000

education 2 453 326 135 006

age 3 453 2545 937 044

ACT 4 453 2842 469 022

SATV 5 453 61066 11231 528

SATQ 6 442 59600 11307 538

attr(call)

bydataframe(data = x INDICES = group FUN = describe type = type

skew = FALSE ranges = FALSE)

The output from the describeBy function can be forced into a matrix form for easy analysisby other programs In addition describeBy can group by several grouping variables at thesame time

gt samat lt- describeBy(satactlist(satact$gendersatact$education)

+ skew=FALSEranges=FALSEmat=TRUE)

gt headTail(samat)

item group1 group2 vars n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

ltNAgt ltNAgt ltNAgt

SATQ9 69 1 4 6 51 6359 10412 1458

SATQ10 70 2 4 6 86 59759 10624 1146

SATQ11 71 1 5 6 46 65783 8961 1321

SATQ12 72 2 5 6 93 60672 10555 1095

321 Outlier detection using outlier

One way to detect unusual data is to consider how far each data point is from the mul-tivariate centroid of the data That is find the squared Mahalanobis distance for eachdata point and then compare these to the expected values of χ2 This produces a Q-Q(quantle-quantile) plot with the n most extreme data points labeled (Figure 1) The outliervalues are in the vector d2

10

gt png( outlierpng )

gt d2 lt- outlier(satactcex=8)

gt devoff()

null device

1

Figure 1 Using the outlier function to graphically show outliers The y axis is theMahalanobis D2 the X axis is the distribution of χ2 for the same number of degrees offreedom The outliers detected here may be shown graphically using pairspanels (see2 and may be found by sorting d2

11

322 Basic data cleaning using scrub

If after describing the data it is apparent that there were data entry errors that need tobe globally replaced with NA or only certain ranges of data will be analyzed the data canbe ldquocleanedrdquo using the scrub function

Consider a data set of 10 rows of 12 columns with values from 1 - 120 All values of columns3 - 5 that are less than 30 40 or 50 respectively or greater than 70 in any of the threecolumns will be replaced with NA In addition any value exactly equal to 45 will be setto NA (max and isvalue are set to one value here but they could be a different value forevery column)

gt x lt- matrix(1120ncol=10byrow=TRUE)

gt colnames(x) lt- paste(V110sep=)gt newx lt- scrub(x35min=c(304050)max=70isvalue=45newvalue=NA)

gt newx

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

[1] 1 2 NA NA NA 6 7 8 9 10

[2] 11 12 NA NA NA 16 17 18 19 20

[3] 21 22 NA NA NA 26 27 28 29 30

[4] 31 32 33 NA NA 36 37 38 39 40

[5] 41 42 43 44 NA 46 47 48 49 50

[6] 51 52 53 54 55 56 57 58 59 60

[7] 61 62 63 64 65 66 67 68 69 70

[8] 71 72 NA NA NA 76 77 78 79 80

[9] 81 82 NA NA NA 86 87 88 89 90

[10] 91 92 NA NA NA 96 97 98 99 100

[11] 101 102 NA NA NA 106 107 108 109 110

[12] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decreased and the minimums havegone up but the maximums down Data cleaning and examination for outliers should be aroutine part of any data analysis

323 Recoding categorical variables into dummy coded variables

Sometimes categorical variables (eg college major occupation ethnicity) are to be ana-lyzed using correlation or regression To do this one can form ldquodummy codesrdquo which aremerely binary variables for each category This may be done using dummycode Subse-quent analyses using these dummy coded variables may be using biserial or point biserial(regular Pearson r) to show effect sizes and may be plotted in eg spider plots

33 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well ascommunicating important results Scatter Plot Matrices (SPLOMS) using the pairspanels

12

function are useful ways to look for strange effects involving outliers and non-linearitieserrorbarsby will show group means with 95 confidence boundaries By default er-rorbarsby and errorbars will show ldquocats eyesrdquo to graphically show the confidencelimits (Figure 5) This may be turned off by specifying eyes=FALSE densityBy or vio-

linBy may be used to show the distribution of the data in ldquoviolinrdquo plots (Figure 4) (Theseare sometimes called ldquolava-lamprdquo plots)

331 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data The pairspanelsfunction adapted from the help menu for the pairs function produces xy scatter plots ofeach pair of variables below the diagonal shows the histogram of each variable on thediagonal and shows the lowess locally fit regression line as well An ellipse around themean with the axis length reflecting one standard deviation of the x and y variables is alsodrawn The x axis in each scatter plot represents the column variable the y axis the rowvariable (Figure 2) When plotting many subjects it is both faster and cleaner to set theplot character (pch) to be rsquorsquo (See Figure 2 for an example)

pairspanels will show the pairwise scatter plots of all the variables as well as his-tograms locally smoothed regressions and the Pearson correlation When plottingmany data points (as in the case of the satact data it is possible to specify that theplot character is a period to get a somewhat cleaner graphic However in this figureto show the outliers we use colors and a larger plot character

Another example of pairspanels is to show differences between experimental groupsConsider the data in the affect data set The scores reflect post test scores on positiveand negative affect and energetic and tense arousal The colors show the results for fourmovie conditions depressing frightening movie neutral and a comedy

332 Density or violin plots

Graphical presentation of data may be shown using box plots to show the median and 25thand 75th percentiles A powerful alternative is to show the density distribution using theviolinBy function (Figure 4)

13

gt png( pairspanelspng )

gt satd2 lt- dataframe(satactd2) combine the d2 statistics from before with the satact dataframe

gt pairspanels(satd2bg=c(yellowblue)[(d2 gt 25)+1]pch=21)

gt devoff()

null device

1

Figure 2 Using the pairspanels function to graphically show relationships The x axisin each scatter plot represents the column variable the y axis the row variable Note theextreme outlier for the ACT If the plot character were set to a period (pch=rsquorsquo) it wouldmake a cleaner graphic but in to show the outliers in color we use the plot characters 21and 22

14

gt png(affectpng)gt pairspanels(affect[1417]bg=c(redblackwhiteblue)[affect$Film]pch=21

+ main=Affect varies by movies )

gt devoff()

null device

1

Figure 3 Using the pairspanels function to graphically show relationships The x axis ineach scatter plot represents the column variable the y axis the row variable The coloringrepresent four different movie conditions

15

gt data(satact)

gt violinBy(satact[56]satact$gendergrpname=c(M F)main=Density Plot by gender for SAT V and Q)

Density Plot by gender for SAT V and Q

Obs

erve

d

SATV M SATV F SATQ M SATQ F

200

300

400

500

600

700

800

Figure 4 Using the violinBy function to show the distribution of SAT V and Q for malesand females The plot shows the medians and 25th and 75th percentiles as well as theentire range and the density distribution

16

333 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data aswell as to draw error bars in both the x and y directions for paired data These are thefunctions errorbars errorbarsby errorbarstab and errorcrosses

errorbars show the 95 confidence intervals for each variable in a data frame or ma-trix These errors are based upon normal theory and the standard errors of the meanAlternative options include +- one standard deviation or 1 standard error If thedata are repeated measures the error bars will be reflect the between variable cor-relations By default the confidence intervals are displayed using a ldquocats eyesrdquo plotwhich emphasizes the distribution of confidence within the confidence interval

errorbarsby does the same but grouping the data by some condition

errorbarstab draws bar graphs from tabular data with error bars based upon thestandard error of proportion (σp =

radicpqN)

errorcrosses draw the confidence intervals for an x set and a y set of the same size

The use of the errorbarsby function allows for graphic comparisons of different groups(see Figure 5) Five personality measures are shown as a function of high versus low scoreson a ldquolierdquo scale People with higher lie scores tend to report being more agreeable consci-entious and less neurotic than people with lower lie scores The error bars are based uponnormal theory and thus are symmetric rather than reflect any skewing in the data

Although not recommended it is possible to use the errorbars function to draw bargraphs with associated error bars (This kind of dynamite plot (Figure 7) can be verymisleading in that the scale is arbitrary Go to a discussion of the problems in presentingdata this way at httpemdbolkerwikidotcomblogdynamite In the example shownnote that the graph starts at 0 although is out of the range This is a function of usingbars which always are assumed to start at zero Consider other ways of showing yourdata

334 Error bars for tabular data

However it is sometimes useful to show error bars for tabular data either found by thetable function or just directly input These may be found using the errorbarstab

function

17

gt data(epibfi)

gt errorbarsby(epibfi[610]epibfi$epilielt4)

095 confidence limits

Independent Variable

Dep

ende

nt V

aria

ble

bfagree bfcon bfext bfneur bfopen

050

100

150

Figure 5 Using the errorbarsby function shows that self reported personality scales onthe Big Five Inventory vary as a function of the Lie scale on the EPI The ldquocats eyesrdquo showthe distribution of the confidence

18

gt errorbarsby(satact[56]satact$genderbars=TRUE

+ labels=c(MaleFemale)ylab=SAT scorexlab=)

Male Female

095 confidence limits

SAT

sco

re

200

300

400

500

600

700

800

200

300

400

500

600

700

800

Figure 6 A ldquoDynamite plotrdquo of SAT scores as a function of gender is one way of misleadingthe reader By using a bar graph the range of scores is ignored Bar graphs start from 0

19

gt T lt- with(satacttable(gendereducation))

gt rownames(T) lt- c(MF)

gt errorbarstab(Tway=bothylab=Proportion of Education Levelxlab=Level of Education

+ main=Proportion of sample by education level)

Proportion of sample by education level

Level of Education

Pro

port

ion

of E

duca

tion

Leve

l

000

005

010

015

020

025

030

M 0 M 1 M 2 M 3 M 4 M 5

000

005

010

015

020

025

030

Figure 7 The proportion of each education level that is Male or Female By using theway=rdquobothrdquo option the percentages and errors are based upon the grand total Alterna-tively way=rdquocolumnsrdquo finds column wise percentages way=rdquorowsrdquo finds rowwise percent-ages The data can be converted to percentages (as shown) or by total count (raw=TRUE)The function invisibly returns the probabilities and standard errors See the help menu foran example of entering the data as a dataframe

20

335 Two dimensional displays of means and errors

Yet another way to display data for different conditions is to use the errorCrosses func-tion For instance the effect of various movies on both ldquoEnergetic Arousalrdquo and ldquoTenseArousalrdquo can be seen in one graph and compared to the same movie manipulations onldquoPositive Affectrdquo and ldquoNegative Affectrdquo Note how Energetic Arousal is increased by threeof the movie manipulations but that Positive Affect increases following the Happy movieonly

21

gt op lt- par(mfrow=c(12))

gt data(affect)

gt colors lt- c(blackredwhiteblue)

gt films lt- c(SadHorrorNeutralHappy)

gt affectstats lt- errorCircles(EA2TA2data=affect[-c(120)]group=Filmlabels=films

+ xlab=Energetic Arousal ylab=Tense Arousalylim=c(1022)xlim=c(820)pch=16

+ cex=2colors=colors main = Movies effect on arousal)gt errorCircles(PA2NA2data=affectstatslabels=filmsxlab=Positive Affect

+ ylab=Negative Affect pch=16cex=2colors=colors main =Movies effect on affect)

gt op lt- par(mfrow=c(11))

8 12 16 20

1012

1416

1820

22

Movies effect on arousal

Energetic Arousal

Tens

e A

rous

al

SadHorror

NeutralHappy

6 8 10 12

24

68

10

Movies effect on affect

Positive Affect

Neg

ativ

e A

ffect

Sad

Horror

NeutralHappy

Figure 8 The use of the errorCircles function allows for two dimensional displays ofmeans and error bars The first call to errorCircles finds descriptive statistics for theaffect dataframe based upon the grouping variable of Film These data are returned andthen used by the second call which examines the effect of the same grouping variable upondifferent measures The size of the circles represent the relative sample sizes for each groupThe data are from the PMC lab and reported in Smillie et al (2012)

22

336 Back to back histograms

The bibars function summarize the characteristics of two groups (eg males and females)on a second variable (eg age) by drawing back to back histograms (see Figure 9)

gt data(bfi)

gt with(bfibibars(agegenderylab=Agemain=Age by males and females))

Age by males and females

Frequency

Age

Age by males and females

0

20

40

60

80

100

100 50 0 50 100

Figure 9 A bar plot of the age distribution for males and females shows the use of bibarsThe data are males and females from 2800 cases collected using the SAPA procedure andare available as part of the bfi data set

23

337 Correlational structure

There are many ways to display correlations Tabular displays are probably the mostcommon The output from the cor function in core R is a rectangular matrix lowerMat

will round this to (2) digits and then display as a lower off diagonal matrix lowerCor

calls cor with use=lsquopairwisersquo method=lsquopearsonrsquo as default values and returns (invisibly)the full correlation matrix and displays the lower off diagonal matrix

gt lowerCor(satact)

gendr edctn age ACT SATV SATQ

gender 100

education 009 100

age -002 055 100

ACT -004 015 011 100

SATV -002 005 -004 056 100

SATQ -017 003 -003 059 064 100

When comparing results from two different groups it is convenient to display them as onematrix with the results from one group below the diagonal and the other group above thediagonal Use lowerUpper to do this

gt female lt- subset(satactsatact$gender==2)

gt male lt- subset(satactsatact$gender==1)

gt lower lt- lowerCor(male[-1])

edctn age ACT SATV SATQ

education 100

age 061 100

ACT 016 015 100

SATV 002 -006 061 100

SATQ 008 004 060 068 100

gt upper lt- lowerCor(female[-1])

edctn age ACT SATV SATQ

education 100

age 052 100

ACT 016 008 100

SATV 007 -003 053 100

SATQ 003 -009 058 063 100

gt both lt- lowerUpper(lowerupper)

gt round(both2)

education age ACT SATV SATQ

education NA 052 016 007 003

age 061 NA 008 -003 -009

ACT 016 015 NA 053 058

SATV 002 -006 061 NA 063

SATQ 008 004 060 068 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-low the diagonal) and the difference of the second from the first above the diagonal

24

gt diffs lt- lowerUpper(lowerupperdiff=TRUE)

gt round(diffs2)

education age ACT SATV SATQ

education NA 009 000 -005 005

age 061 NA 007 -003 013

ACT 016 015 NA 008 002

SATV 002 -006 061 NA 005

SATQ 008 004 060 068 NA

338 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat mapof the correlations This is just a matrix color coded to represent the magnitude of thecorrelation This is useful when considering the number of factors in a data set Considerthe Thurstone data set which has a clear 3 factor solution (Figure 10) or a simulated dataset of 24 variables with a circumplex structure (Figure 11) The color coding representsa ldquoheat maprdquo of the correlations with darker shades of red representing stronger negativeand darker shades of blue stronger positive correlations As an option the value of thecorrelation can be shown

Yet another way to show structure is to use ldquospiderrdquo plots Particularly if variables areordered in some meaningful way (eg in a circumplex) a spider plot will show this structureeasily This is just a plot of the magnitude of the correlation as a radial line with lengthranging from 0 (for a correlation of -1) to 1 (for a correlation of 1) (See Figure 12)

34 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to testwhether these correlations differ from zero or differ from each other The cortest func-tion (in the stats package) will test the significance of a single correlation and the rcorr

function in the Hmisc package will do this for many correlations In the psych packagethe corrtest function reports the correlation (Pearson Spearman or Kendall) betweenall variables in either one or two data frames or matrices as well as the number of obser-vations for each case and the (two-tailed) probability for each correlation Unfortunatelythese probability values have not been corrected for multiple comparisons and so shouldbe taken with a great deal of salt Thus in corrtest and corrp the raw probabilitiesare reported below the diagonal and the probabilities adjusted for multiple comparisonsusing (by default) the Holm correction are reported above the diagonal (Table 1) (See thepadjust function for a discussion of Holm (1979) and other corrections)

Testing the difference between any two correlations can be done using the rtest functionThe function actually does four different tests (based upon an article by Steiger (1980)

25

gt png(corplotpng)gt corPlot(Thurstonenumbers=TRUEupper=FALSEdiag=FALSEmain=9 cognitive variables from Thurstone)

gt devoff()

null device

1

Figure 10 The structure of correlation matrix can be seen more clearly if the variables aregrouped by factor and then the correlations are shown by color By using the rsquonumbersrsquooption the values are displayed as well By default the complete matrix is shown Settingupper=FALSE and diag=FALSE shows a cleaner figure

26

gt png(circplotpng)gt circ lt- simcirc(24)

gt rcirc lt- cor(circ)

gt corPlot(rcircmain=24 variables in a circumplex)gt devoff()

null device

1

Figure 11 Using the corPlot function to show the correlations in a circumplex Correlationsare highest near the diagonal diminish to zero further from the diagonal and the increaseagain towards the corners of the matrix Circumplex structures are common in the studyof affect For circumplex structures it is perhaps useful to show the complete matrix

27

gt png(spiderpng)gt oplt- par(mfrow=c(22))

gt spider(y=c(161218)x=124data=rcircfill=TRUEmain=Spider plot of 24 circumplex variables)

gt op lt- par(mfrow=c(11))

gt devoff()

null device

1

Figure 12 A spider plot can show circumplex structure very clearly Circumplex structuresare common in the study of affect

28

Table 1 The corrtest function reports correlations cell sizes and raw and adjustedprobability values corrp reports the probability values for a correlation matrix Bydefault the adjustment used is that of Holm (1979)gt corrtest(satact)

Callcorrtest(x = satact)

Correlation matrix

gender education age ACT SATV SATQ

gender 100 009 -002 -004 -002 -017

education 009 100 055 015 005 003

age -002 055 100 011 -004 -003

ACT -004 015 011 100 056 059

SATV -002 005 -004 056 100 064

SATQ -017 003 -003 059 064 100

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability values (Entries above the diagonal are adjusted for multiple tests)

gender education age ACT SATV SATQ

gender 000 017 100 100 1 0

education 002 000 000 000 1 1

age 058 000 000 003 1 1

ACT 033 000 000 000 0 0

SATV 062 022 026 000 0 0

SATQ 000 036 037 000 0 0

To see confidence intervals of the correlations print with the short=FALSE option

29

depending upon the input

1) For a sample size n find the t and p value for a single correlation as well as the confidenceinterval

gt rtest(503)

Correlation tests

Callrtest(n = 50 r12 = 03)

Test of significance of a correlation

t value 218 with probability lt 0034

and confidence interval 002 053

2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference betweenthe z transformed correlations divided by the standard error of the difference of two zscores

gt rtest(3046)

Correlation tests

Callrtest(n = 30 r12 = 04 r34 = 06)

Test of difference between two independent correlations

z value 099 with probability 032

3) For sample size n and correlations ra= r12 rb= r23 and r13 specified test for thedifference of two dependent correlations (Steiger case A)

gt rtest(103451)

Correlation tests

Call[1] rtest(n = 103 r12 = 04 r23 = 01 r13 = 05 )

Test of difference between two correlated correlations

t value -089 with probability lt 037

4) For sample size n test for the difference between two dependent correlations involvingdifferent variables (Steiger case B)

gt rtest(103567558) steiger Case B

Correlation tests

Callrtest(n = 103 r12 = 05 r34 = 06 r23 = 07 r13 = 05 r14 = 05

r24 = 08)

Test of difference between two dependent correlations

z value -12 with probability 023

To test whether a matrix of correlations differs from what would be expected if the popu-lation correlations were all zero the function cortest follows Steiger (1980) who pointedout that the sum of the squared elements of a correlation matrix or the Fisher z scoreequivalents is distributed as chi square under the null hypothesis that the values are zero(ie elements of the identity matrix) This is particularly useful for examining whethercorrelations in a single matrix differ from zero or for comparing two matrices Althoughobvious cortest can be used to test whether the satact data matrix produces non-zerocorrelations (it does) This is a much more appropriate test when testing whether a residualmatrix differs from zero

gt cortest(satact)

30

Tests of correlation matrices

Callcortest(R1 = satact)

Chi Square value 132542 with df = 15 with probability lt 18e-273

35 Polychoric tetrachoric polyserial and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient If thedata eg ability items are thought to represent an underlying continuous although latentvariable the φ will underestimate the value of the Pearson applied to these latent variablesOne solution to this problem is to use the tetrachoric correlation which is based uponthe assumption of a bivariate normal distribution that has been cut at certain points Thedrawtetra function demonstrates the process (Figure 13) This is also shown in termsof dichotomizing the bivariate normal density function using the drawcor function (Fig-ure 14) A simple generalization of this to the case of the multiple cuts is the polychoric

correlation

Other estimated correlations based upon the assumption of bivariate normality with cutpoints include the biserial and polyserial correlation

If the data are a mix of continuous polytomous and dichotomous variables the mixedcor

function will calculate the appropriate mixture of Pearson polychoric tetrachoric biserialand polyserial correlations

The correlation matrix resulting from a number of tetrachoric or polychoric correlationmatrix sometimes will not be positive semi-definite This will sometimes happen if thecorrelation matrix is formed by using pair-wise deletion of cases The corsmooth functionwill adjust the smallest eigen values of the correlation matrix to make them positive rescaleall of them to sum to the number of variables and produce aldquosmoothedrdquocorrelation matrixAn example of this problem is a data set of burt which probably had a typo in the originalcorrelation matrix Smoothing the matrix corrects this problem

36 Multiple regression from data or correlation matrices

The typical application of the lm function is to do a linear model of one Y variable as afunction of multiple X variables Because lm is designed to analyze complex interactions itrequires raw data as input It is however sometimes convenient to do multiple regressionfrom a correlation or covariance matrix The setCor function will do this taking a set of yvariables predicted from a set of x variables perhaps with a set of z covariates removed fromboth x and y Consider the Thurstone correlation matrix and find the multiple correlationof the last five variables as a function of the first 4

gt setCor(y = 59x=14data=Thurstone)

31

gt drawtetra()

minus3 minus2 minus1 0 1 2 3

minus3

minus2

minus1

01

23

Y rho = 05phi = 033

X gt τY gt Τ

X lt τY gt Τ

X gt τY lt Τ

X lt τY lt Τ

x

dnor

m(x

)

X gt τ

τ

x1

Y gt Τ

Τ

Figure 13 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values

32

gt drawcor(expand=20cuts=c(00))

xy

z

Bivariate density rho = 05

Figure 14 The tetrachoric correlation estimates what a Pearson correlation would be givena two by two table of observed values assumed to be sampled from a bivariate normaldistribution The φ correlation is just a Pearson r performed on the observed values It isfound (laboriously) by optimizing the fit of the bivariate normal for various values of thecorrelation to the observed cell frequencies

33

Call setCor(y = 59 x = 14 data = Thurstone)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

Sentences 009 007 025 021 020

Vocabulary 009 017 009 016 -002

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

069 063 050 058

LetterGroup

048

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

048 040 025 034

LetterGroup

023

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

059 058 049 058

LetterGroup

045

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

034 034 024 033

LetterGroup

020

Various estimates of between set correlations

Squared Canonical Correlations

[1] 06280 01478 00076 00049

Average squared canonical correlation = 02

Cohens Set Correlation R2 = 069

Unweighted correlation between the two sets = 073

By specifying the number of subjects in correlation matrix appropriate estimates of stan-dard errors t-values and probabilities are also found The next example finds the regres-sions with variables 1 and 2 used as covariates The β weights for variables 3 and 4 do notchange but the multiple correlation is much less It also shows how to find the residualcorrelations between variables 5-9 with variables 1-4 removed

gt sc lt- setCor(y = 59x=34data=Thurstonez=12)

Call setCor(y = 59 x = 34 data = Thurstone z = 12)

Multiple Regression from matrix input

Beta weights

FourLetterWords Suffixes LetterSeries Pedigrees LetterGroup

SentCompletion 002 005 004 021 008

FirstLetters 058 045 021 008 031

34

Multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

058 046 021 018

LetterGroup

030

multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

0331 0210 0043 0032

LetterGroup

0092

Unweighted multiple R

FourLetterWords Suffixes LetterSeries Pedigrees

044 035 017 014

LetterGroup

026

Unweighted multiple R2

FourLetterWords Suffixes LetterSeries Pedigrees

019 012 003 002

LetterGroup

007

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0405 0023

Average squared canonical correlation = 021

Cohens Set Correlation R2 = 042

Unweighted correlation between the two sets = 048

gt round(sc$residual2)

FourLetterWords Suffixes LetterSeries Pedigrees

FourLetterWords 052 011 009 006

Suffixes 011 060 -001 001

LetterSeries 009 -001 075 028

Pedigrees 006 001 028 066

LetterGroup 013 003 037 020

LetterGroup

FourLetterWords 013

Suffixes 003

LetterSeries 037

Pedigrees 020

LetterGroup 077

37 Mediation and Moderation analysis

Although multiple regression is a straightforward method for determining the effect ofmultiple predictors (x12i) on a criterion variable y some prefer to think of the effect ofone predictor x as mediated by another variable m (Preacher and Hayes 2004) Thuswe we may find the indirect path from x to m and then from m to y as well as the directpath from x to y Call these paths a b and c respectively Then the indirect effect of x

35

on y through m is just ab and the direct effect is c Statistical tests of the ab effect arebest done by bootstrapping

Consider the example from Preacher and Hayes (2004) as analyzed using the mediate

function and the subsequent graphic from mediatediagram The data are found in theexample for mediate

Call mediate(y = 1 x = 2 m = 3 data = sobel)

The DV (Y) was SATIS The IV (X) was THERAPY The mediating variable(s) = ATTRIB

Total Direct effect(c) of THERAPY on SATIS = 076 SE = 031 t direct = 25 with probability = 0019

Direct effect (c) of THERAPY on SATIS removing ATTRIB = 043 SE = 032 t direct = 135 with probability = 019

Indirect effect (ab) of THERAPY on SATIS through ATTRIB = 033

Mean bootstrapped indirect effect = 033 with standard error = 017 Lower CI = 004 Upper CI = 071

R2 of model = 031

To see the longer output specify short = FALSE in the print statement

Full output

Total effect estimates (c)

SATIS se t Prob

THERAPY 076 031 25 00186

Direct effect estimates (c)SATIS se t Prob

THERAPY 043 032 135 0190

ATTRIB 040 018 223 0034

a effect estimates

THERAPY se t Prob

ATTRIB 082 03 274 00106

b effect estimates

SATIS se t Prob

ATTRIB 04 018 223 0034

ab effect estimates

SATIS boot sd lower upper

THERAPY 033 033 017 004 071

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and ofscales and for finding various estimates of scale reliability These may be considered asproblems of dimension reduction (eg factor analysis cluster analysis principal compo-nents analysis) and of forming and estimating the reliability of the resulting compositescales

36

gt mediatediagram(preacher)

Mediation model

THERAPY SATIS

ATTRIB

082

c = 076

c = 043

04

Figure 15 A mediated model taken from Preacher and Hayes 2004 and solved using themediate function The direct path from Therapy to Satisfaction has a an effect of 76 whilethe indirect path through Attribution has an effect of 33 Compare this to the normalregression graphic created by setCordiagram

37

gt preacher lt- setCor(1c(23)sobelstd=FALSE)

gt setCordiagram(preacher)

Regression Models

THERAPY

ATTRIB

SATIS

043

04

021

unweighted matrix correlation = 056

T

Figure 16 The conventional regression model for the Preacher and Hayes 2004 data setsolved using the sector function Compare this to the previous figure

38

41 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictumcommonly attributed to William of Ockham to not multiply entities beyond necessity1 Thegoal for parsimony is seen in psychometrics as an attempt either to describe (components)or to explain (factors) the relationships between many observed variables in terms of amore limited set of components or latent factors

The typical data matrix represents multiple items or scales usually thought to reflect fewerunderlying constructs2 At the most simple a set of items can be be thought to representa random sample from one underlying domain or perhaps a small set of domains Thequestion for the psychometrician is how many domains are represented and how well doeseach item represent the domains Solutions to this problem are examples of factor analysis(FA) principal components analysis (PCA) and cluster analysis (CA) All of these pro-cedures aim to reduce the complexity of the observed data In the case of FA the goal isto identify fewer underlying constructs to explain the observed data In the case of PCAthe goal can be mere data reduction but the interpretation of components is frequentlydone in terms similar to those used when describing the latent variables estimated by FACluster analytic techniques although usually used to partition the subject space ratherthan the variable space can also be used to group variables to reduce the complexity ofthe data by forming fewer and more homogeneous sets of tests or items

At the data level the data reduction problem may be solved as a Singular Value Decom-position of the original matrix although the more typical solution is to find either theprincipal components or factors of the covariance or correlation matrices Given the pat-tern of regression weights from the variables to the components or from the factors to thevariables it is then possible to find (for components) individual component or cluster scoresor estimate (for factors) factor scores

Several of the functions in psych address the problem of data reduction

fa incorporates five alternative algorithms minres factor analysis principal axis factoranalysis weighted least squares factor analysis generalized least squares factor anal-ysis and maximum likelihood factor analysis That is it includes the functionality ofthree other functions that will be eventually phased out

fapoly is useful when finding the factor structure of categorical items fapoly first findsthe tetrachoric or polychoric correlations between the categorical variables and thenproceeds to do a normal factor analysis By setting the niter option to be greater

1Although probably neither original with Ockham nor directly stated by him (Thorburn 1918) Ock-hamrsquos razor remains a fundamental principal of science

2Cattell (1978) as well as MacCallum et al (2007) argue that the data are the result of many morefactors than observed variables but are willing to estimate the major underlying factors

39

than 1 it will also find confidence intervals for the factor solution Warning Findingpolychoric correlations is very slow so think carefully before doing so

factorminres (deprecated) Minimum residual factor analysis is a least squares itera-tive solution to the factor problem minres attempts to minimize the residual (off-diagonal) correlation matrix It produces solutions similar to maximum likelihoodsolutions but will work even if the matrix is singular

factorpa (deprecated) Principal Axis factor analysis is a least squares iterative so-lution to the factor problem PA will work for cases where maximum likelihoodtechniques (factanal) will not work The original communality estimates are eitherthe squared multiple correlations (smc) for each item or 1

factorwls (deprecated) Weighted least squares factor analysis is a least squares iter-ative solution to the factor problem It minimizes the (weighted) squared residualmatrix The weights are based upon the independent contribution of each variable

principal Principal Components Analysis reports the largest n eigen vectors rescaled bythe square root of their eigen values Note that PCA is not the same as factor analysisand the two should not be confused

factorcongruence The congruence between two factors is the cosine of the angle betweenthem This is just the cross products of the loadings divided by the sum of the squaredloadings This differs from the correlation coefficient in that the mean loading is notsubtracted before taking the products factorcongruence will find the cosinesbetween two (or more) sets of factor loadings

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test todetermine the optimal number of factors to extract It can be thought of as a quasi-confirmatory model in that it fits the very simple structure (all except the biggest cloadings per item are set to zero where c is the level of complexity of the item) of afactor pattern matrix to the original correlation matrix For items where the model isusually of complexity one this is equivalent to making all except the largest loadingfor each item 0 This is typically the solution that the user wants to interpret Theanalysis includes the MAP criterion of Velicer (1976) and a χ2 estimate

nfactors combines VSS MAP and a number of other fit statistics The depressing realityis that frequently these conventional fit estimates of the number of factors do notagree

faparallel The parallel factors technique compares the observed eigen values of a cor-relation matrix with those from random data

faplot will plot the loadings from a factor principal components or cluster analysis(just a call to plot will suffice) If there are more than two factors then a SPLOM

40

of the loadings is generated

fadiagram replaces fagraph and will draw a path diagram representing the factor struc-ture It does not require Rgraphviz and thus is probably preferred

fagraph requires Rgraphviz and will draw a graphic representation of the factor struc-ture If factors are correlated this will be represented as well

iclust is meant to do item cluster analysis using a hierarchical clustering algorithmspecifically asking questions about the reliability of the clusters (Revelle 1979) Clus-ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail toincrease

411 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rankThat is can the correlation matrix ~nRn be approximated by the product of a factor matrix~nFk and its transpose plus a diagonal matrix of uniqueness

R = FF prime+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-age Five alternatives are provided in psych all of them are included in the fa functionand are called by specifying the factor method (eg fm=ldquominresrdquo fm=ldquopardquo fm=lsquowlsrdquofm=ldquoglsrdquo and fm=ldquomlrdquo) In the discussion of the other algorithms the calls shown are tothe fa function specifying the appropriate method

factorminres attempts to minimize the off diagonal residual correlation matrix by ad-justing the eigen values of the original correlation matrix This is similar to what is donein factanal but uses an ordinary least squares instead of a maximum likelihood fit func-tion The solutions tend to be more similar to the MLE solutions than are the factorpa

solutions minres is the default for the fa function

A classic data set collected by Thurstone and Thurstone (1941) and then reanalyzed byBechtoldt (1961) and discussed by McDonald (1999) is a set of 9 cognitive variables witha clear bi-factor structure Holzinger and Swineford (1937) The minimum residual solu-tion was transformed into an oblique solution using the default option on rotate whichuses an oblimin transformation (Table 2) Alternative rotations and transformations in-clude ldquononerdquo ldquovarimaxrdquo ldquoquartimaxrdquo ldquobentlerTrdquo ldquovariminrdquo and ldquogeominTrdquo (all of whichare orthogonal rotations) as well as ldquopromaxrdquo ldquoobliminrdquo ldquosimplimaxrdquo ldquobentlerQ andldquogeominQrdquo and ldquoclusterrdquo which are possible oblique transformations of the solution Thedefault is to do a oblimin transformation The measures of factor adequacy reflect the

41

multiple correlations of the factors with the best fitting linear regression estimates of thefactor scores (Grice 2001)

Note that if extracting more than one factor and doing any oblique rotation it is necessaryto have the GPArotation installed This is checked for in the appropriate functions

412 Principal Axis Factor Analysis

An alternative least squares algorithm (included in fa with the fm=pa option or as astandalone function (factorpa) does a Principal Axis factor analysis by iteratively doingan eigen value decomposition of the correlation matrix with the diagonal replaced by thevalues estimated by the factors of the previous iteration This OLS solution is not assensitive to improper matrices as is the maximum likelihood method and will sometimesproduce more interpretable results It seems as if the SAS example for PA uses only oneiteration Setting the maxiter parameter to 1 produces the SAS solution

The solutions from the fa the factorminres and factorpa as well as the principal

functions can be rotated or transformed with a number of options Some of these callthe GPArotation package Orthogonal rotations include varimax quartimax variminbifactor Oblique transformations include oblimin quartimin biquartimin and thentwo targeted rotation functions Promax and targetrot The latter of these will transforma loadings matrix towards an arbitrary target matrix The default is to transform towardsan independent cluster solution

Using the Thurstone data set three factors were requested and then transformed into anindependent clusters solution using targetrot (Table 3)

413 Weighted Least Squares Factor Analysis

Similar to the minres approach of minimizing the squared residuals factor method ldquowlsrdquoweights the squared residuals by their uniquenesses This tends to produce slightly smalleroverall residuals In the example of weighted least squares the output is shown by using theprint function with the cut option set to 0 That is all loadings are shown (Table 4)

The unweighted least squares solution may be shown graphically using the faplot functionwhich is called by the generic plot function (Figure 17) Factors were transformed obliquelyusing a oblimin These solutions may be shown as item by factor plots (Figure 17) or bya structure diagram (Figure 18)

A comparison of these three approaches suggests that the minres solution is more similarto a maximum likelihood solution and fits slightly better than the pa or wls solutionsComparisons with SPSS suggest that the pa solution matches the SPSS OLS solution but

42

Table 2 Three correlated factors from the Thurstone 9 variable problem By default thesolution is transformed obliquely using oblimin The extraction method is (by default)minimum residualgt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3t lt- fa(Thurstone3nobs=213)

+ f3t

Factor Analysis using method = minres

Call fa(r = Thurstone nfactors = 3 nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 10

Vocabulary 089 006 -003 084 016 10

SentCompletion 083 004 000 073 027 10

FirstLetters 000 086 000 073 027 10

FourLetterWords -001 074 010 063 037 10

Suffixes 018 063 -008 050 050 12

LetterSeries 003 -001 084 072 028 10

Pedigrees 037 -005 047 050 050 19

LetterGroup -006 021 064 053 047 12

MR1 MR2 MR3

SS loadings 264 186 150

Proportion Var 029 021 017

Cumulative Var 029 050 067

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

With factor correlations of

MR1 MR2 MR3

MR1 100 059 054

MR2 059 100 052

MR3 054 052 100

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 52 with Chi Square of 108197

The degrees of freedom for the model are 12 and the objective function was 001

The root mean square of the residuals (RMSR) is 001

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 058 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 282 with prob lt 1

Tucker Lewis Index of factoring reliability = 1027

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6151

Fit based upon off diagonal values = 1

Measures of factor score adequacy

MR1 MR2 MR3

Correlation of scores with factors 096 092 090

Multiple R square of scores with factors 093 085 081

Minimum correlation of possible factor scores 086 071 063

43

Table 3 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Theextraction method was principal axis The transformation was a targeted transformationto a simple cluster solutiongt if(require(GPArotation)) stop(GPArotation must be installed to do rotations) else

+ f3 lt- fa(Thurstone3nobs = 213fm=pa)

+ f3o lt- targetrot(f3)

+ f3o

Call NULL

Standardized loadings (pattern matrix) based upon correlation matrix

PA1 PA2 PA3 h2 u2

Sentences 089 -003 007 081 019

Vocabulary 089 007 000 080 020

SentCompletion 083 004 003 070 030

FirstLetters -002 085 -001 073 027

FourLetterWords -005 074 009 057 043

Suffixes 017 063 -009 043 057

LetterSeries -006 -008 084 069 031

Pedigrees 033 -009 048 037 063

LetterGroup -014 016 064 045 055

PA1 PA2 PA3

SS loadings 245 172 137

Proportion Var 027 019 015

Cumulative Var 027 046 062

Proportion Explained 044 031 025

Cumulative Proportion 044 075 100

PA1 PA2 PA3

PA1 100 002 008

PA2 002 100 009

PA3 008 009 100

44

Table 4 The 9 variable problem from Thurstone is a classic example of factoring wherethere is a higher order factor g that accounts for the correlation between the factors Thefactors were extracted using a weighted least squares algorithm All loadings are shown byusing the cut=0 option in the printpsych functiongt f3w lt- fa(Thurstone3nobs = 213fm=wls)

gt print(f3wcut=0digits=3)

Factor Analysis using method = wls

Call fa(r = Thurstone nfactors = 3 nobs = 213 fm = wls)

Standardized loadings (pattern matrix) based upon correlation matrix

WLS1 WLS2 WLS3 h2 u2 com

Sentences 0905 -0034 0040 0822 0178 101

Vocabulary 0890 0066 -0031 0835 0165 101

SentCompletion 0833 0034 0007 0735 0265 100

FirstLetters -0002 0855 0003 0731 0269 100

FourLetterWords -0016 0743 0106 0629 0371 104

Suffixes 0180 0626 -0082 0496 0504 120

LetterSeries 0033 -0015 0838 0719 0281 100

Pedigrees 0381 -0051 0464 0505 0495 195

LetterGroup -0062 0209 0632 0527 0473 124

WLS1 WLS2 WLS3

SS loadings 2647 1864 1488

Proportion Var 0294 0207 0165

Cumulative Var 0294 0501 0667

Proportion Explained 0441 0311 0248

Cumulative Proportion 0441 0752 1000

With factor correlations of

WLS1 WLS2 WLS3

WLS1 1000 0591 0535

WLS2 0591 1000 0516

WLS3 0535 0516 1000

Mean item complexity = 12

Test of the hypothesis that 3 factors are sufficient

The degrees of freedom for the null model are 36 and the objective function was 5198 with Chi Square of 1081968

The degrees of freedom for the model are 12 and the objective function was 0014

The root mean square of the residuals (RMSR) is 0006

The df corrected root mean square of the residuals is 001

The harmonic number of observations is 213 with the empirical chi square 0531 with prob lt 1

The total number of observations was 213 with Likelihood Chi Square = 2886 with prob lt 0996

Tucker Lewis Index of factoring reliability = 10264

RMSEA index = 0 and the 90 confidence intervals are NA NA

BIC = -6145

Fit based upon off diagonal values = 1

Measures of factor score adequacy

WLS1 WLS2 WLS3

Correlation of scores with factors 0964 0923 0902

Multiple R square of scores with factors 0929 0853 0814

Minimum correlation of possible factor scores 0858 0706 0627

45

gt plot(f3t)

MR1

00 02 04 06 08

00

02

04

06

08

00

02

04

06

08

MR2

00 02 04 06 08

00 02 04 06 08

00

02

04

06

08MR3

Factor Analysis

Figure 17 A graphic representation of the 3 oblique factors from the Thurstone data usingplot Factors were transformed to an oblique solution using the oblimin function from theGPArotation package

46

gt fadiagram(f3t)

Factor Analysis

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

MR1

09

09

08

MR2

09

07

06

MR308

06

05

06

05

05

Figure 18 A graphic representation of the 3 oblique factors from the Thurstone data usingfadiagram Factors were transformed to an oblique solution using oblimin

47

that the minres solution is slightly better At least in one test data set the weighted leastsquares solutions although fitting equally well had slightly different structure loadingsNote that the rotations used by SPSS will sometimes use the ldquoKaiser Normalizationrdquo Bydefault the rotations used in psych do not normalize but this can be specified as an optionin fa

414 Principal Components analysis (PCA)

An alternative to factor analysis which is unfortunately frequently confused with factoranalysis is principal components analysis Although the goals of PCA and FA are similarPCA is a descriptive model of the data while FA is a structural model Some psychologistsuse PCA in a manner similar to factor analysis and thus the principal function producesoutput that is perhaps more understandable than that produced by princomp in the statspackage Table 5 shows a PCA of the Thurstone 9 variable problem rotated using thePromax function Note how the loadings from the factor model are similar but smaller thanthe principal component loadings This is because the PCA model attempts to accountfor the entire variance of the correlation matrix while FA accounts for just the commonvariance This distinction becomes most important for small correlation matrices Alsonote how the goodness of fit statistics based upon the residual off diagonal elements ismuch worse than the fa solution

415 Hierarchical and bi-factor solutions

For a long time structural analysis of the ability domain have considered the problem offactors that are themselves correlated These correlations may themselves be factored toproduce a higher order general factor An alternative (Holzinger and Swineford 1937Jensen and Weng 1994) is to consider the general factor affecting each item and thento have group factors account for the residual variance Exploratory factor solutions toproduce a hierarchical or a bifactor solution are found using the omega function Thistechnique has more recently been applied to the personality domain to consider such thingsas the structure of neuroticism (treated as a general factor with lower order factors ofanxiety depression and aggression)

Consider the 9 Thurstone variables analyzed in the prior factor analyses The correlationsbetween the factors (as shown in Figure 18 can themselves be factored This results in ahigher order factor model (Figure 19) An an alternative solution is to take this higherorder model and then solve for the general factor loadings as well as the loadings on theresidualized lower order factors using the Schmid-Leiman procedure (Figure 20) Yetanother solution is to use structural equation modeling to directly solve for the general andgroup factors

48

Table 5 The Thurstone problem can also be analyzed using Principal Components Anal-ysis Compare this to Table 3 The loadings are higher for the PCA because the modelaccounts for the unique as well as the common varianceThe fit of the off diagonal elementshowever is much worse than the fa resultsgt p3p lt-principal(Thurstone3nobs = 213rotate=Promax)

gt p3p

Principal Components Analysis

Call principal(r = Thurstone nfactors = 3 rotate = Promax nobs = 213)

Standardized loadings (pattern matrix) based upon correlation matrix

RC1 RC2 RC3 h2 u2 com

Sentences 092 001 001 086 014 10

Vocabulary 090 010 -005 086 014 10

SentCompletion 091 004 -004 083 017 10

FirstLetters 001 084 007 078 022 10

FourLetterWords -005 081 017 075 025 11

Suffixes 018 079 -015 070 030 12

LetterSeries 003 -003 088 078 022 10

Pedigrees 045 -016 057 067 033 21

LetterGroup -019 019 086 075 025 12

RC1 RC2 RC3

SS loadings 283 219 196

Proportion Var 031 024 022

Cumulative Var 031 056 078

Proportion Explained 041 031 028

Cumulative Proportion 041 072 100

With component correlations of

RC1 RC2 RC3

RC1 100 051 053

RC2 051 100 044

RC3 053 044 100

Mean item complexity = 12

Test of the hypothesis that 3 components are sufficient

The root mean square of the residuals (RMSR) is 006

with the empirical chi square 5617 with prob lt 11e-07

Fit based upon off diagonal values = 098

49

gt omh lt- omega(Thurstonenobs=213sl=FALSE)

gt op lt- par(mfrow=c(11))

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

09

09

08

04

F2

09

07

06

02F3

08

06

05

g

08

08

07

Figure 19 A higher order factor solution to the Thurstone 9 variable problem

50

gt om lt- omega(Thurstonenobs=213)

Omega

Sentences

Vocabulary

SentCompletion

FirstLetters

FourLetterWords

Suffixes

LetterSeries

LetterGroup

Pedigrees

F1

06

06

05

02

F2

06

05

04

F306

05

03

g

07

07

07

06

06

06

06

05

06

Figure 20 A bifactor factor solution to the Thurstone 9 variable problem

51

Yet another approach to the bifactor structure is do use the bifactor rotation function ineither psych or in GPArotation This does the rotation discussed in Jennrich and Bentler(2011)

416 Item Cluster Analysis iclust

An alternative to factor or components analysis is cluster analysis The goal of clusteranalysis is the same as factor or components analysis (reduce the complexity of the dataand attempt to identify homogeneous subgroupings) Mainly used for clustering peopleor objects (eg projectile points if an anthropologist DNA if a biologist galaxies if anastronomer) clustering may be used for clustering items or tests as well Introduced topsychologists by Tryon (1939) in the 1930rsquos the cluster analytic literature exploded inthe 1970s and 1980s (Blashfield 1980 Blashfield and Aldenderfer 1988 Everitt 1974Hartigan 1975) Much of the research is in taxonmetric applications in biology (Sneathand Sokal 1973 Sokal and Sneath 1963) and marketing (Cooksey and Soutar 2006) whereclustering remains very popular It is also used for taxonomic work in forming clusters ofpeople in family (Henry et al 2005) and clinical psychology (Martinent and Ferrand 2007Mun et al 2008) Interestingly enough it has has had limited applications to psychometricsThis is unfortunate for as has been pointed out by eg (Tryon 1935 Loevinger et al 1953)the theory of factors while mathematically compelling offers little that the geneticist orbehaviorist or perhaps even non-specialist finds compelling Cooksey and Soutar (2006)reviews why the iclust algorithm is particularly appropriate for scale construction inmarketing

Hierarchical cluster analysis forms clusters that are nested within clusters The resultingtree diagram (also known somewhat pretentiously as a rooted dendritic structure) shows thenesting structure Although there are many hierarchical clustering algorithms in R (egagnes hclust and iclust) the one most applicable to the problems of scale constructionis iclust (Revelle 1979)

1 Find the proximity (eg correlation) matrix

2 Identify the most similar pair of items

3 Combine this most similar pair of items to form a new variable (cluster)

4 Find the similarity of this cluster to all other items and clusters

5 Repeat steps 2 and 3 until some criterion is reached (eg typicallly if only one clusterremains or in iclust if there is a failure to increase reliability coefficients α or β )

6 Purify the solution by reassigning items to the most similar cluster center

52

iclust forms clusters of items using a hierarchical clustering algorithm until one of twomeasures of internal consistency fails to increase (Revelle 1979) The number of clustersmay be specified a priori or found empirically The resulting statistics include the averagesplit half reliability α (Cronbach 1951) as well as the worst split half reliability β (Revelle1979) which is an estimate of the general factor saturation of the resulting scale (Figure 21)Cluster loadings (corresponding to the structure matrix of factor analysis) are reportedwhen printing (Table 8) The pattern matrix is available as an object in the results

gt data(bfi)

gt ic lt- iclust(bfi[125])

ICLUST

C20α = 081β = 063

C19α = 076β = 064

063

C11α = 072β = 069 06

C6077

E4 072E2 minus072

E1 minus077

C12α = 068β = 064

079C4073

O3 063O1 063

C909E5 063

E3 063

C18α = 071β = 05

069

C17α = 072β = 061

054

A4 07C10

α = 072β = 068

078A2 077

C5091A5 071

A3 071

A1 minus061

C16α = 081β = 076

C13α = 071β = 065

086

N5 075

C8086N4 072

N3 072

C3076N2 084

N1 084

C15α = 073β = 067

C14α = 063β = 058

077

C3 07

C1084C2 065

C1 065

C2minus079C5 069

C4 069

C21α = 041β = 027

C7032

O2 057O5 057

O4 minus042

Figure 21 Using the iclust function to find the cluster structure of 25 personality items(the three demographic variables were excluded from this analysis) When analyzing manyvariables the tree structure may be seen more clearly if the graphic output is saved as apdf and then enlarged using a pdf viewer

53

Table 6 The summary statistics from an iclust analysis shows three large clusters andsmaller clustergt summary(ic) show the results

ICLUST (Item Cluster Analysis)Call iclust(rmat = bfi[125])

ICLUST

Purified Alpha

C20 C16 C15 C21

080 081 073 061

Guttman Lambda6

C20 C16 C15 C21

082 081 072 061

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -0291 040 -033

C16 -024 0815 -029 011

C15 030 -0221 073 -030

C21 -023 0074 -020 061

54

The previous analysis (Figure 21) was done using the Pearson correlation A somewhatcleaner structure is obtained when using the polychoric function to find polychoric corre-lations (Figure 22) Note that the first time finding the polychoric correlations some timebut the next three analyses were done using that correlation matrix (rpoly$rho) Whenusing the console for input polychoric will report on its progress while working usingprogressBar

Table 7 The polychoric and the tetrachoric functions can take a long time to finishand report their progress by a series of dots as they work The dots are suppressed whencreating a Sweave documentgt data(bfi)

gt rpoly lt- polychoric(bfi[125]) the indicate the progress of the function

A comparison of these four cluster solutions suggests both a problem and an advantage ofclustering techniques The problem is that the solutions differ The advantage is that thestructure of the items may be seen more clearly when examining the clusters rather thana simple factor solution

42 Confidence intervals using bootstrapping techniques

Exploratory factoring techniques are sometimes criticized because of the lack of statisticalinformation on the solutions Overall estimates of goodness of fit including χ2 and RMSEAare found in the fa and omega functions Confidence intervals for the factor loadings maybe found by doing multiple bootstrapped iterations of the original analysis This is doneby setting the niter parameter to the desired number of iterations This can be done forfactoring of Pearson correlation matrices as well as polychorictetrachoric matrices (SeeTable 9) Although the example value for the number of iterations is set to 20 moreconventional analyses might use 1000 bootstraps This will take much longer

Bootstrapped confidence intervals can also be found for the loadings of a factoring of a poly-choric matrix fapoly will find the polychoric correlation matrix and if the niter optionis greater than 1 will then randomly resample the data (case wise) to give bootstrappedsamples This will take a long time for large number of items or interations

43 Comparing factorcomponentcluster solutions

Cluster analysis factor analysis and principal components analysis all produce structurematrices (matrices of correlations between the dimensions and the variables) that canin turn be compared in terms of Burtrsquos congruence coefficient (also known as Tuckerrsquos

55

gt icpoly lt- iclust(rpoly$rhotitle=ICLUST using polychoric correlations)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 22 ICLUST of the BFI data set using polychoric correlations Compare this solutionto the previous one (Figure 21) which was done using Pearson correlations

56

gt icpoly lt- iclust(rpoly$rho5title=ICLUST using polychoric correlations for nclusters=5)

gt iclustdiagram(icpoly)

ICLUST using polychoric correlations for nclusters=5

C20α = 083β = 066

C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

C15α = 077β = 071

C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

O4

C7O5 063O2 063

Figure 23 ICLUST of the BFI data set using polychoric correlations with the solutionset to 5 clusters Compare this solution to the previous one (Figure 22) which was donewithout specifying the number of clusters and to the next one (Figure 24) which was doneby changing the beta criterion

57

gt icpoly lt- iclust(rpoly$rhobetasize=3title=ICLUST betasize=3)

ICLUST betasize=3

C23α = 083β = 058

C22α = 081β = 029

06

C21α = 048β = 035031

C704

O5 063O2 063

O4 minus052

C20α = 083β = 066

minus031C19α = 08β = 066

066

C12α = 072β = 068

07

C2075

O3 067O1 067

C909E5 066

E3 066

C11α = 077β = 073

minus068E1 08

C6minus091E4 076

E2 minus076

C18α = 076β = 056

minus069A1 065C17

α = 077β = 065

minus068A4 073C10

α = 077β = 073

079A2 081

C3092A5 076

A3 076

C15α = 077β = 071

minus049C14α = 067β = 06108

C4069

C2 07C1 07

C3 072

C1minus081C5 073

C4 073

C16α = 084β = 079

C13α = 074β = 069

088

N5 078

C8086N4 075

N3 075

C5077N2 087

N1 087

Figure 24 ICLUST of the BFI data set using polychoric correlations with the beta criterionset to 3 Compare this solution to the previous three (Figure 21 22 23)

58

Table 8 The output from iclust includes the loadings of each item on each cluster Theseare equivalent to factor structure loadings By specifying the value of cut small loadingsare suppressed The default is for cut=0sugt print(iccut=3)

ICLUST (Item Cluster Analysis)

Call iclust(rmat = bfi[125])

Purified Alpha

C20 C16 C15 C21

080 081 073 061

G6 reliability

C20 C16 C15 C21

083 100 067 038

Original Beta

C20 C16 C15 C21

063 076 067 027

Cluster size

C20 C16 C15 C21

10 5 5 5

Item by Cluster Structure matrix

O P C20 C16 C15 C21

A1 C20 C20

A2 C20 C20 059

A3 C20 C20 065

A4 C20 C20 043

A5 C20 C20 065

C1 C15 C15 054

C2 C15 C15 062

C3 C15 C15 054

C4 C15 C15 031 -066

C5 C15 C15 -030 036 -059

E1 C20 C20 -050

E2 C20 C20 -061 034

E3 C20 C20 059 -039

E4 C20 C20 066

E5 C20 C20 050 040 -032

N1 C16 C16 076

N2 C16 C16 075

N3 C16 C16 074

N4 C16 C16 -034 062

N5 C16 C16 055

O1 C20 C21 -053

O2 C21 C21 044

O3 C20 C21 039 -062

O4 C21 C21 -033

O5 C21 C21 053

With eigenvalues of

C20 C16 C15 C21

32 26 19 15

Purified scale intercorrelations

reliabilities on diagonal

correlations corrected for attenuation above diagonal

C20 C16 C15 C21

C20 080 -029 040 -033

C16 -024 081 -029 011

C15 030 -022 073 -030

C21 -023 007 -020 061

Cluster fit = 068 Pattern fit = 096 RMSR = 005

NULL

59

Table 9 An example of bootstrapped confidence intervals on 10 items from the Big 5 inven-tory The number of bootstrapped samples was set to 20 More conventional bootstrappingwould use 100 or 1000 replicationsgt fa(bfi[110]2niter=20)

Factor Analysis with confidence intervals using method = fa(r = bfi[110] nfactors = 2 niter = 20)

Factor Analysis using method = minres

Call fa(r = bfi[110] nfactors = 2 niter = 20)

Standardized loadings (pattern matrix) based upon correlation matrix

MR2 MR1 h2 u2 com

A1 007 -040 015 085 11

A2 002 065 044 056 10

A3 -003 077 057 043 10

A4 015 044 026 074 12

A5 002 062 039 061 10

C1 057 -005 030 070 10

C2 062 -001 039 061 10

C3 054 003 030 070 10

C4 -066 001 043 057 10

C5 -057 -005 035 065 10

MR2 MR1

SS loadings 180 177

Proportion Var 018 018

Cumulative Var 018 036

Proportion Explained 050 050

Cumulative Proportion 050 100

With factor correlations of

MR2 MR1

MR2 100 032

MR1 032 100

Mean item complexity = 1

Test of the hypothesis that 2 factors are sufficient

The degrees of freedom for the null model are 45 and the objective function was 203 with Chi Square of 566489

The degrees of freedom for the model are 26 and the objective function was 017

The root mean square of the residuals (RMSR) is 004

The df corrected root mean square of the residuals is 005

The harmonic number of observations is 2762 with the empirical chi square 40338 with prob lt 26e-69

The total number of observations was 2800 with Likelihood Chi Square = 46404 with prob lt 92e-82

Tucker Lewis Index of factoring reliability = 0865

RMSEA index = 0006 and the 90 confidence intervals are 0006 0084

BIC = 25767

Fit based upon off diagonal values = 098

Measures of factor score adequacy

MR2 MR1

Correlation of scores with factors 086 088

Multiple R square of scores with factors 074 077

Minimum correlation of possible factor scores 049 054

Coefficients and bootstrapped confidence intervals

low MR2 upper low MR1 upper

A1 003 007 011 -044 -040 -037

A2 -002 002 005 061 065 070

A3 -007 -003 000 073 077 080

A4 010 015 020 040 044 048

A5 -002 002 006 057 062 067

C1 054 057 060 -009 -005 -002

C2 058 062 067 -004 -001 002

C3 048 054 058 000 003 008

C4 -069 -066 -060 -003 001 004

C5 -062 -057 -052 -008 -005 -002

Interfactor correlations and bootstrapped confidence intervals

lower estimate upper

MR2-MR1 027 032 037

gt

60

coefficient) which is just the cosine of the angle between the dimensions

c fi f j =sum

nk=1 fik f jk

sum f 2ik sum f 2

jk

Consider the case of a four factor solution and four cluster solution to the Big Five prob-lem

gt f4 lt- fa(bfi[125]4fm=pa)

gt factorcongruence(f4ic)

C20 C16 C15 C21

PA1 092 -032 044 -040

PA2 -026 095 -033 012

PA3 035 -024 088 -037

PA4 029 -012 027 -090

A more complete comparison of oblique factor solutions (both minres and principal axis) bi-factor and component solutions to the Thurstone data set is done using the factorcongruencefunction (See table 10)

Table 10 Congruence coefficients for oblique factor bifactor and component solutions forthe Thurstone problemgt factorcongruence(list(f3tf3oomp3p))

MR1 MR2 MR3 PA1 PA2 PA3 g F1 F2 F3 h2 RC1 RC2 RC3

MR1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

MR2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

MR3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

PA1 100 003 001 100 004 005 067 100 003 001 069 100 006 -004

PA2 006 100 001 004 100 000 057 006 100 001 054 004 099 005

PA3 013 006 100 005 000 100 054 013 006 100 053 010 001 099

g 072 060 052 067 057 054 100 072 060 052 099 069 058 050

F1 100 006 009 100 006 013 072 100 006 009 074 100 008 004

F2 006 100 008 003 100 006 060 006 100 008 057 004 099 012

F3 009 008 100 001 001 100 052 009 008 100 051 006 002 099

h2 074 057 051 069 054 053 099 074 057 051 100 071 056 049

RC1 100 004 006 100 004 010 069 100 004 006 071 100 006 000

RC2 008 099 002 006 099 001 058 008 099 002 056 006 100 005

RC3 004 012 099 -004 005 099 050 004 012 099 049 000 005 100

44 Determining the number of dimensions to extract

How many dimensions to use to represent a correlation matrix is an unsolved problem inpsychometrics There are many solutions to this problem none of which is uniformly thebest Henry Kaiser once said that ldquoa solution to the number-of factors problem in factoranalysis is easy that he used to make up one every morning before breakfast But theproblem of course is to find the solution or at least a solution that others will regard quitehighly not as the bestrdquo Horn and Engstrom (1979)

61

Techniques most commonly used include

1) Extracting factors until the chi square of the residual matrix is not significant

2) Extracting factors until the change in chi square from factor n to factor n+1 is notsignificant

3) Extracting factors until the eigen values of the real data are less than the correspondingeigen values of a random data set of the same size (parallel analysis) faparallel (Horn1965)

4) Plotting the magnitude of the successive eigen values and applying the scree test (asudden drop in eigen values analogous to the change in slope seen when scrambling up thetalus slope of a mountain and approaching the rock face (Cattell 1966)

5) Extracting factors as long as they are interpretable

6) Using the Very Structure Criterion (vss) (Revelle and Rocklin 1979)

7) Using Wayne Velicerrsquos Minimum Average Partial (MAP) criterion (Velicer 1976)

8) Extracting principal components until the eigen value lt 1

Each of the procedures has its advantages and disadvantages Using either the chi squaretest or the change in square test is of course sensitive to the number of subjects and leadsto the nonsensical condition that if one wants to find many factors one simply runs moresubjects Parallel analysis is partially sensitive to sample size in that for large samples theeigen values of random factors will all tend towards 1 The scree test is quite appealingbut can lead to differences of interpretation as to when the scree ldquobreaksrdquo Extractinginterpretable factors means that the number of factors reflects the investigators creativitymore than the data vss while very simple to understand will not work very well if thedata are very factorially complex (Simulations suggests it will work fine if the complexitiesof some of the items are no more than 2) The eigen value of 1 rule although the defaultfor many programs seems to be a rough way of dividing the number of variables by 3 andis probably the worst of all criteria

An additional problem in determining the number of factors is what is considered a factorMany treatments of factor analysis assume that the residual correlation matrix after thefactors of interest are extracted is composed of just random error An alternative con-cept is that the matrix is formed from major factors of interest but that there are alsonumerous minor factors of no substantive interest but that account for some of the sharedcovariance between variables The presence of such minor factors can lead one to extracttoo many factors and to reject solutions on statistical grounds of misfit that are actuallyvery good fits to the data This problem is partially addressed later in the discussion ofsimulating complex structures using simstructure and of small extraneous factors usingthe simminor function

62

441 Very Simple Structure

The vss function compares the fit of a number of factor analyses with the loading matrixldquosimplifiedrdquo by deleting all except the c greatest loadings per item where c is a measureof factor complexity Revelle and Rocklin (1979) Included in vss is the MAP criterion(Minimum Absolute Partial correlation) of Velicer (1976)

Using the Very Simple Structure criterion for the bfi data suggests that 4 factors are optimal(Figure 25) However the MAP criterion suggests that 5 is optimal

gt vss lt- vss(bfi[125]title=Very Simple Structure of a Big 5 inventory)

11

1 11 1 1 1

1 2 3 4 5 6 7 8

00

02

04

06

08

10

Number of Factors

Ver

y S

impl

e S

truc

ture

Fit

Very Simple Structure of a Big 5 inventory

2

22 2 2

2 233

3 3 3 344 4 4 4

Figure 25 The Very Simple Structure criterion for the number of factors compares solutionsfor various levels of item complexity and various numbers of factors For the Big 5 Inventorythe complexity 1 and 2 solutions both achieve their maxima at four factors This is incontrast to parallel analysis which suggests 6 and the MAP criterion which suggests 5

63

gt vss

Very Simple Structure of Very Simple Structure of a Big 5 inventory

Call vss(x = bfi[125] title = Very Simple Structure of a Big 5 inventory)

VSS complexity 1 achieves a maximimum of 058 with 4 factors

VSS complexity 2 achieves a maximimum of 074 with 4 factors

The Velicer MAP achieves a minimum of 001 with 5 factors

BIC achieves a minimum of -52426 with 8 factors

Sample Size adjusted BIC achieves a minimum of -11756 with 8 factors

Statistics by number of factors

vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex

1 049 000 0024 275 11831 00e+00 260 049 00150 9648 105221 10

2 054 063 0018 251 7279 00e+00 189 063 00100 5287 60845 12

3 057 069 0017 228 5010 00e+00 148 071 00075 3200 39243 13

4 058 074 0015 206 3366 00e+00 117 077 00055 1731 23851 14

5 053 073 0015 185 1750 14e-252 95 081 00030 281 8693 16

6 054 072 0016 165 1014 44e-122 84 084 00018 -296 2285 17

7 052 070 0019 146 696 14e-72 79 084 00013 -463 12 19

8 052 069 0022 128 492 47e-44 74 085 00010 -524 -1176 19

eChisq SRMR eCRMS eBIC

1 23881 0119 0125 21698

2 12432 0086 0094 10440

3 7232 0066 0075 5422

4 3750 0047 0057 2115

5 1495 0030 0038 27

6 670 0020 0027 -639

7 448 0016 0023 -711

8 289 0013 0020 -727

442 Parallel Analysis

An alternative way to determine the number of factors is to compare the solution to randomdata with the same properties as the real data set If the input is a data matrix thecomparison includes random samples from the real data as well as normally distributedrandom data with the same number of subjects and variables For the BFI data parallelanalysis suggests that 6 factors might be most appropriate (Figure 26) It is interestingto compare faparallel with the paran from the paran package This latter uses smcsto estimate communalities Simulations of known structures with a particular number ofmajor factors but with the presence of trivial minor (but not zero) factors show that usingsmcs will tend to lead to too many factors

A more tedious problem in terms of computation is to do parallel analysis of polychoriccorrelation matrices This is done by faparallelpoly By default the number of repli-cations is 20 This is appropriate when choosing the number of factors from dicthotomousor polytomous data matrices

64

gt faparallel(bfi[125]main=Parallel Analysis of a Big 5 inventory)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

5 10 15 20 25

01

23

45

Parallel Analysis of a Big 5 inventory

FactorComponent Number

eige

nval

ues

of p

rinci

pal c

ompo

nent

s an

d fa

ctor

ana

lysi

s

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

Figure 26 Parallel analysis compares factor and principal components solutions to the realdata as well as resampled data Although vss suggests 4 factors MAP 5 parallel analysissuggests 6 One more demonstration of Kaiserrsquos dictum

65

45 Factor extension

Sometimes we are interested in the relationship of the factors in one space with the variablesin a different space One solution is to find factors in both spaces separately and then findthe structural relationships between them This is the technique of structural equationmodeling in packages such as sem or lavaan An alternative is to use the concept offactor extension developed by (Dwyer 1937) Consider the case of 16 variables createdto represent one two dimensional space If factors are found from eight of these variablesthey may then be extended to the additional eight variables (See Figure 27)

Another way to examine the overlap between two sets is the use of set correlation foundby setcor (discussed later)

46 Exploratory Structural Equation Modeling (ESEM)

Generaizing the procedures of factor extension we can do Exploratory Structural EquationModeling (ESEM) Traditional Exploratory Factor Analysis (EFA) examines how latentvariables can account for the correlations within a data set All loadings and cross loadingsare found and rotation is done to some approximation of simple structure TraditionalConfirmatory Factor Analysis (CFA) tests such models by fitting just a limited number ofloadings and typically does not allow any (or many) cross loadings Structural EquationModeling then applies two such measurement models one to a set of X variables anotherto a set of Y variables and then tries to estimate the correlation between these two sets oflatent variables (Some SEM procedures estimate all the parameters from the same modelthus making the loadings in set Y affect those in set X) It is possible to do a similarexploratory modeling (ESEM) by conducting two Exploratory Factor Analyses one in setX one in set Y and then finding the correlations of the X factors with the Y factors aswell as the correlations of the Y variables with the X factors and the X variables with theY factors

Consider the simulated data set of three ability variables two motivational variables andthree outcome variables

Call simstructural(fx = fx Phi = Phi fy = fy)

$model (Population correlation matrix)

V Q A nach Anx gpa Pre MA

V 100 072 054 000 000 038 032 025

Q 072 100 048 000 000 034 028 022

A 054 048 100 048 -042 050 042 034

nach 000 000 048 100 -056 034 028 022

66

gt v16 lt- simitem(16)

gt s lt- c(13579111315)

gt f2 lt- fa(v16[s]2)

gt fe lt- faextension(cor(v16)[s-s]f2)

gt fadiagram(f2fe=fe)

Factor analysis and extension

V9

V1

V11

V3

V15

V7

V13

V5

MR1

minus07

06

minus06

06

MR2

06

minus06

06

minus05

V10

minus06

V12minus06

V206

V406

V8minus06

V1406

V1606

V6

minus05

Figure 27 Factor extension applies factors from one set (those on the left) to another setof variables (those on the right) faextension is particularly useful when one wants todefine the factors with one set of variables and then apply those factors to another setfadiagram is used to show the structure

67

Anx 000 000 -042 -056 100 -029 -024 -020

gpa 038 034 050 034 -029 100 030 024

Pre 032 028 042 028 -024 030 100 020

MA 025 022 034 022 -020 024 020 100

$reliability (population reliability)

V Q A nach Anx gpa Pre MA

081 064 072 064 049 036 025 016

We can fit this by using the esem function and then draw the solution (see Figure 28) usingthe esemdiagram function (which is normally called automatically by esem

Exploratory Structural Equation Modeling Analysis using method = minres

Call esem(r = gregpa$model varsX = 15 varsY = 68 nfX = 2 nfY = 1

nobs = 1000 plot = FALSE)

For the X set

MR1 MR2

V 091 -006

Q 081 -005

A 053 057

nach -010 081

Anx 008 -071

For the Y set

MR1

gpa 06

Pre 05

MA 04

Correlations between the X and Y sets

X1 X2 Y1

X1 100 019 068

X2 019 100 067

Y1 068 067 100

The degrees of freedom for the null model are 56 and the empirical chi square function was 693029

The degrees of freedom for the model are 7 and the empirical chi square function was 2183

with prob lt 00027

The root mean square of the residuals (RMSR) is 002

The df corrected root mean square of the residuals is 004

68

with the empirical chi square 2183 with prob lt 00027

The total number of observations was 1000 with fitted Chi Square = 217506 with prob lt 0

Empirical BIC = -2653

ESABIC = -429

Fit based upon off diagonal values = 1

To see the item loadings for the X and Y sets combined and the associated fa output print with short=FALSE

Exploratory Structural Model

V

Q

A

nach

Anx

MR1

09

08

MR2

06

08

minus07

gpa

Pre

MA

MR1

06

05

04

07

07

Figure 28 An example of a Exploratory Structure Equation Model

69

5 Classical Test Theory and Reliability

Surprisingly 110 years after Spearman (1904) introduced the concept of reliability to psy-chologists there are still multiple approaches for measuring it Although very popularCronbachrsquos α (Cronbach 1951) underestimates the reliability of a test and over estimatesthe first factor saturation (Revelle and Zinbarg 2009)

α (Cronbach 1951) is the same as Guttmanrsquos λ3 (Guttman 1945) and may be foundby

λ3 =n

nminus1

(1minus tr(~V )x

Vx

)=

nnminus1

Vxminus tr(~Vx)

Vx= α

Perhaps because it is so easy to calculate and is available in most commercial programsalpha is without doubt the most frequently reported measure of internal consistency relia-bility Alpha is the mean of all possible spit half reliabilities (corrected for test length) Fora unifactorial test it is a reasonable estimate of the first factor saturation although if thetest has any microstructure (ie if it is ldquolumpyrdquo) coefficients β (Revelle 1979) (see iclust)and ωh (see omega) are more appropriate estimates of the general factor saturation ωt is abetter estimate of the reliability of the total test

Guttmanrsquos λ6 (G6) considers the amount of variance in each item that can be accountedfor the linear regression of all of the other items (the squared multiple correlation or smc)or more precisely the variance of the errors e2

j and is

λ6 = 1minussume2

j

Vx= 1minus sum(1minus r2

smc)

Vx

The squared multiple correlation is a lower bound for the item communality and as thenumber of items increases becomes a better estimate

G6 is also sensitive to lumpiness in the test and should not be taken as a measure ofunifactorial structure For lumpy tests it will be greater than alpha For tests with equalitem loadings alpha gt G6 but if the loadings are unequal or if there is a general factorG6 gt alpha G6 estimates item reliability by the squared multiple correlation of the otheritems in a scale A modification of G6 G6 takes as an estimate of an item reliabilitythe smc with all the items in an inventory including those not keyed for a particular scaleThis will lead to a better estimate of the reliable variance of a particular item

Alpha G6 and G6 are positive functions of the number of items in a test as well as the av-erage intercorrelation of the items in the test When calculated from the item variances andtotal test variance as is done here raw alpha is sensitive to differences in the item variancesStandardized alpha is based upon the correlations rather than the covariances

70

More complete reliability analyses of a single scale can be done using the omega functionwhich finds ωh and ωt based upon a hierarchical factor analysis

Alternative functions scoreItems and clustercor will also score multiple scales andreport more useful statistics ldquoStandardizedrdquo alpha is calculated from the inter-item corre-lations and will differ from raw alpha

Functions for examining the reliability of a single scale or a set of scales include

alpha Internal consistency measures of reliability range from ωh to α to ωt The alpha

function reports two estimates Cronbachrsquos coefficient α and Guttmanrsquos λ6 Alsoreported are item - whole correlations α if an item is omitted and item means andstandard deviations

guttman Eight alternative estimates of test reliability include the six discussed by Guttman(1945) four discussed by ten Berge and Zergers (1978) (micro0 micro3) as well as β (theworst split half Revelle 1979) the glb (greatest lowest bound) discussed by Bentlerand Woodward (1980) and ωh andωt ((McDonald 1999 Zinbarg et al 2005)

omega Calculate McDonaldrsquos omega estimates of general and total factor saturation(Revelle and Zinbarg (2009) compare these coefficients with real and artificial datasets)

clustercor Given a n x c cluster definition matrix of -1s 0s and 1s (the keys) and a nx n correlation matrix find the correlations of the composite clusters

scoreItems Given a matrix or dataframe of k keys for m items (-1 0 1) and a matrixor dataframe of items scores for m items and n people find the sum scores or av-erage scores for each person and each scale If the input is a square matrix thenit is assumed that correlations or covariances were used and the raw scores are notavailable In addition report Cronbachrsquos alpha coefficient G6 the average r thescale intercorrelations and the item by scale correlations (both raw and corrected foritem overlap and scale reliability) Replace missing values with the item median ormean if desired Will adjust scores for reverse scored items

scoremultiplechoice Ability tests are typically multiple choice with one right answerscoremultiplechoice takes a scoring key and a data matrix (or dataframe) and findstotal or average number right for each participant Basic test statistics (alpha averager item means item-whole correlations) are also reported

51 Reliability of a single scale

A conventional (but non-optimal) estimate of the internal consistency reliability of a testis coefficient α (Cronbach 1951) Alternative estimates are Guttmanrsquos λ6 Revellersquos β

71

McDonaldrsquos ωh and ωt Consider a simulated data set representing 9 items with a hierar-chical structure and the following correlation matrix Then using the alpha function theα and λ6 estimates of reliability may be found for all 9 items as well as the if one item isdropped at a time

gt setseed(17)

gt r9 lt- simhierarchical(n=500raw=TRUE)$observed

gt round(cor(r9)2)

V1 V2 V3 V4 V5 V6 V7 V8 V9

V1 100 058 059 041 044 030 040 031 025

V2 058 100 048 035 036 022 024 029 019

V3 059 048 100 032 035 023 029 020 017

V4 041 035 032 100 044 036 026 027 022

V5 044 036 035 044 100 032 024 023 020

V6 030 022 023 036 032 100 026 026 012

V7 040 024 029 026 024 026 100 038 025

V8 031 029 020 027 023 026 038 100 025

V9 025 019 017 022 020 012 025 025 100

gt alpha(r9)

Reliability analysis

Call alpha(x = r9)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

08 08 08 031 4 0013 0017 063

lower alpha upper 95 confidence boundaries

077 08 083

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 075 075 075 028 31 0017

V2 077 077 077 030 34 0015

V3 077 077 077 030 34 0015

V4 077 077 077 030 34 0015

V5 078 078 077 030 35 0015

V6 079 079 079 032 38 0014

V7 078 078 078 031 36 0015

V8 079 079 078 032 37 0014

V9 080 080 080 034 40 0013

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 077 076 076 067 00327 102

V2 500 067 066 062 055 01071 100

V3 500 065 065 061 053 -00462 101

V4 500 065 065 059 053 -00091 098

V5 500 064 064 058 052 -00012 103

V6 500 055 055 046 041 -00499 099

V7 500 060 060 052 046 00481 101

V8 500 058 057 049 043 00526 107

V9 500 047 048 036 032 00164 097

Some scales have items that need to be reversed before being scored Rather than reversingthe items in the raw data it is more convenient to just specify which items need to be

72

reversed scored This may be done in alpha by specifying a keys vector of 1s and -1s(This concept of keys vector is more useful when scoring multiple scale inventories seebelow) As an example consider scoring the 7 attitude items in the attitude data setAssume a conceptual mistake in that items 2 and 6 (complaints and critical) are to bescored (incorrectly) negatively

gt alpha(attitudekeys=c(complaintscritical))

Reliability analysis

Call alpha(x = attitude keys = c(complaints critical))

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

018 027 066 005 037 019 53 47

lower alpha upper 95 confidence boundaries

-018 018 055

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating -0023 0090 052 00162 0099 0265

complaints- 0689 0666 076 02496 1995 0078

privileges -0133 0021 058 00036 0022 0282

learning -0459 -0251 041 -00346 -0201 0363

raises -0178 -0062 047 -00098 -0058 0295

critical- 0364 0473 076 01299 0896 0139

advance -0137 -0016 052 -00026 -0016 0258

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 0600 0601 063 028 65 122

complaints- 30 -0542 -0559 -078 -075 50 133

privileges 30 0679 0663 057 039 53 122

learning 30 0857 0853 090 070 56 117

raises 30 0719 0730 077 050 65 104

critical- 30 0015 0036 -029 -027 42 99

advance 30 0688 0694 066 046 43 103

gt

Note how the reliability of the 7 item scales with an incorrectly reversed item is verypoor but if items 2 and 6 is dropped then the reliability is improved substantially Thissuggests that items 2 and 6 were incorrectly scored Doing the analysis again with theitems positively scored produces much more favorable results

gt alpha(attitude)

Reliability analysis

Call alpha(x = attitude)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

084 084 088 043 52 0042 60 82

lower alpha upper 95 confidence boundaries

076 084 093

Reliability if an item is dropped

73

raw_alpha stdalpha G6(smc) average_r SN alpha se

rating 081 081 083 041 42 0052

complaints 080 080 082 039 39 0057

privileges 083 082 087 044 47 0048

learning 080 080 084 040 40 0054

raises 080 078 083 038 36 0056

critical 086 086 089 051 63 0038

advance 084 083 086 046 50 0043

Item statistics

n rawr stdr rcor rdrop mean sd

rating 30 078 076 075 067 65 122

complaints 30 084 081 082 074 67 133

privileges 30 070 068 060 056 53 122

learning 30 081 080 078 071 56 117

raises 30 085 086 085 079 65 104

critical 30 042 045 031 027 75 99

advance 30 060 062 056 046 43 103

It is useful when considering items for a potential scale to examine the item distributionThis is done in scoreItems as well as in alpha

gt items lt- simcongeneric(N=500short=FALSElow=-2high=2categorical=TRUE) 500 responses to 4 discrete items

gt alpha(items$observed) item response analysis of congeneric measures

Reliability analysis

Call alpha(x = items$observed)

raw_alpha stdalpha G6(smc) average_r SN ase mean sd

072 072 067 039 26 0021 0056 073

lower alpha upper 95 confidence boundaries

068 072 076

Reliability if an item is dropped

raw_alpha stdalpha G6(smc) average_r SN alpha se

V1 059 059 049 032 14 0032

V2 065 065 056 038 19 0027

V3 067 067 059 041 20 0026

V4 072 072 064 046 25 0022

Item statistics

n rawr stdr rcor rdrop mean sd

V1 500 081 081 074 062 0058 097

V2 500 074 075 063 052 0012 098

V3 500 072 072 057 048 0056 099

V4 500 068 067 048 041 0098 102

Non missing response frequency for each item

-2 -1 0 1 2 miss

V1 004 024 040 025 007 0

V2 006 023 040 024 006 0

V3 005 025 037 026 007 0

V4 006 021 037 029 007 0

74

52 Using omega to find the reliability of a single scale

Two alternative estimates of reliability that take into account the hierarchical structure ofthe inventory are McDonaldrsquos ωh and ωt These may be found using the omega functionfor an exploratory analysis (See Figure 29) or omegaSem for a confirmatory analysis usingthe sem based upon the exploratory solution from omega

McDonald has proposed coefficient omega (hierarchical) (ωh) as an estimate of the gen-eral factor saturation of a test Zinbarg et al (2005) httppersonality-project

orgrevellepublicationszinbargrevellepmet05pdf compare McDonaldrsquos ωh toCronbachrsquos α and Revellersquos β They conclude that ωh is the best estimate (See also Zin-barg et al (2006) and Revelle and Zinbarg (2009) httppersonality-projectorg

revellepublicationsrevellezinbarg08pdf )

One way to find ωh is to do a factor analysis of the original data set rotate the factorsobliquely factor that correlation matrix do a Schmid-Leiman (schmid) transformation tofind general factor loadings and then find ωh

ωh differs slightly as a function of how the factors are estimated Four options are availablethe default will do a minimum residual factor analysis fm=ldquopardquodoes a principal axes factoranalysis (factorpa) fm=ldquomlerdquo uses the factanal function and fm=ldquopcrdquo does a principalcomponents analysis (principal)

For ability items it is typically the case that all items will have positive loadings on thegeneral factor However for non-cognitive items it is frequently the case that some itemsare to be scored positively and some negatively Although probably better to specifywhich directions the items are to be scored by specifying a key vector if flip =TRUE(the default) items will be reversed so that they have positive loadings on the generalfactor The keys are reported so that scores can be found using the scoreItems functionArbitrarily reversing items this way can overestimate the general factor (See the examplewith a simulated circumplex)

β an alternative to ω is defined as the worst split half reliability It can be estimated byusing iclust (Item Cluster analysis a hierarchical clustering algorithm) For a very com-plimentary review of why the iclust algorithm is useful in scale construction see Cookseyand Soutar (2006)

The omega function uses exploratory factor analysis to estimate the ωh coefficient It isimportant to remember that ldquoA recommendation that should be heeded regardless of themethod chosen to estimate ωh is to always examine the pattern of the estimated generalfactor loadings prior to estimating ωh Such an examination constitutes an informal testof the assumption that there is a latent variable common to all of the scalersquos indicatorsthat can be conducted even in the context of EFA If the loadings were salient for only arelatively small subset of the indicators this would suggest that there is no true general

75

factor underlying the covariance matrix Just such an informal assumption test would haveafforded a great deal of protection against the possibility of misinterpreting the misleadingωh estimates occasionally produced in the simulations reported hererdquo (Zinbarg et al 2006p 137)

Although ωh is uniquely defined only for cases where 3 or more subfactors are extracted itis sometimes desired to have a two factor solution By default this is done by forcing theschmid extraction to treat the two subfactors as having equal loadings

There are three possible options for this condition setting the general factor loadingsbetween the two lower order factors to be ldquoequalrdquo which will be the

radicrab where rab is the

oblique correlation between the factors) or to ldquofirstrdquo or ldquosecondrdquo in which case the generalfactor is equated with either the first or second group factor A message is issued suggestingthat the model is not really well defined This solution discussed in Zinbarg et al 2007To do this in omega add the option=ldquofirstrdquo or option=ldquosecondrdquo to the call

Although obviously not meaningful for a 1 factor solution it is of course possible to findthe sum of the loadings on the first (and only) factor square them and compare them tothe overall matrix variance This is done with appropriate complaints

In addition to ωh another of McDonaldrsquos coefficients is ωt This is an estimate of the totalreliability of a test

McDonaldrsquos ωt which is similar to Guttmanrsquos λ6 (see guttman) uses the estimates ofuniqueness u2 from factor analysis to find e2

j This is based on a decomposition of thevariance of a test score Vx into four parts that due to a general factor ~g that due toa set of group factors ~f (factors common to some but not all of the items) specificfactors ~s unique to each item and ~e random error (Because specific variance can not bedistinguished from random error unless the test is given at least twice some combine theseboth into error)

Letting ~x = ~cg + ~A f + ~Ds +~e then the communality of item j based upon general as well asgroup factors h2

j = c2j + sum f 2

i j and the unique variance for the item u2j = σ2

j (1minush2j) may be

used to estimate the test reliability That is if h2j is the communality of item j based upon

general as well as group factors then for standardized items e2j = 1minush2

j and

ωt =~1~ccprime~1 +~1 ~AAprime~1prime

Vx= 1minus

sum(1minush2j)

Vx= 1minus sumu2

Vx

Because h2j ge r2

smc ωt ge λ6

It is important to distinguish here between the two ω coefficients of McDonald 1978 andEquation 620a of McDonald 1999 ωt and ωh While the former is based upon the sum ofsquared loadings on all the factors the latter is based upon the sum of the squared loadings

76

on the general factor

ωh =~1~ccprime~1

Vx

Another estimate reported is the omega for an infinite length test with a structure similarto the observed test This is found by

ωinf =~1~ccprime~1

~1~ccprime~1 +~1 ~AAprime~1prime

gt om9 lt- omega(r9title=9 simulated variables)

9 simulated variables

V1

V3

V2

V7

V8

V9

V4

V5

V6

F1

05

04

04

F2

05

04

02

F304

03

03

g

07

06

06

05

04

03

06

05

04

Figure 29 A bifactor solution for 9 simulated variables with a hierarchical structure

In the case of these simulated 9 variables the amount of variance attributable to a generalfactor (ωh) is quite large and the reliability of the set of 9 items is somewhat greater thanthat estimated by α or λ6

77

gt om9

9 simulated variables

Call omega(m = r9 title = 9 simulated variables)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

78

53 Estimating ωh using Confirmatory Factor Analysis

The omegaSem function will do an exploratory analysis and then take the highest loadingitems on each factor and do a confirmatory factor analysis using the sem package Theseresults can produce slightly different estimates of ωh primarily because cross loadings aremodeled as part of the general factor

gt omegaSem(r9nobs=500lavaan=FALSE)

Call omegaSem(m = r9 nobs = 500 lavaan = FALSE)

Omega

Call omega(m = m nfactors = nfactors fm = fm key = key flip = flip

digits = digits title = title sl = sl labels = labels

plot = plot nobs = nobs rotate = rotate Phi = Phi option = option)

Alpha 08

G6 08

Omega Hierarchical 069

Omega H asymptotic 082

Omega Total 083

Schmid Leiman Factor loadings greater than 02

g F1 F2 F3 h2 u2 p2

V1 072 045 072 028 071

V2 058 037 047 053 071

V3 057 041 049 051 066

V4 057 041 050 050 066

V5 055 032 041 059 073

V6 043 027 028 072 066

V7 047 047 044 056 050

V8 043 042 036 064 051

V9 032 023 016 084 063

With eigenvalues of

g F1 F2 F3

248 052 047 036

generalmax 476 maxmin = 146

mean percent general = 064 with sd = 008 and cv of 013

Explained Common Variance of the general factor = 065

The degrees of freedom are 12 and the fit is 003

The number of observations was 500 with Chi Square = 1586 with prob lt 02

The root mean square of the residuals is 002

The df corrected root mean square of the residuals is 003

RMSEA index = 0001 and the 90 confidence intervals are NA 0055

BIC = -5871

Compare this with the adequacy of just a general factor and no group factors

The degrees of freedom for just the general factor are 27 and the fit is 032

The number of observations was 500 with Chi Square = 15982 with prob lt 83e-21

The root mean square of the residuals is 008

The df corrected root mean square of the residuals is 009

RMSEA index = 001 and the 90 confidence intervals are 001 0114

BIC = -797

79

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 084 059 061 054

Multiple R square of scores with factors 071 035 037 029

Minimum correlation of factor score estimates 043 -030 -025 -041

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 056 065

Omega general for total scores and subscales 069 055 030 046

Omega group for total scores and subscales 012 024 026 019

The following analyses were done using the sem package

Omega Hierarchical from a confirmatory model using sem = 072

Omega Total from a confirmatory model using sem = 083

With loadings of

g F1 F2 F3 h2 u2 p2

V1 074 040 071 029 077

V2 058 036 047 053 072

V3 056 042 050 050 063

V4 057 045 053 047 061

V5 058 025 040 060 084

V6 043 026 026 074 071

V7 049 038 038 062 063

V8 044 045 039 061 050

V9 034 024 017 083 068

With eigenvalues of

g F1 F2 F3

260 047 040 033

generalmax 551 maxmin = 141

mean percent general = 068 with sd = 01 and cv of 015

Explained Common Variance of the general factor = 068

Measures of factor score adequacy

g F1 F2 F3

Correlation of scores with factors 087 059 059 055

Multiple R square of scores with factors 075 035 034 030

Minimum correlation of factor score estimates 050 -031 -031 -039

Total General and Subset omega for each subset

g F1 F2 F3

Omega total for total scores and subscales 083 079 057 065

Omega general for total scores and subscales 072 057 033 048

Omega group for total scores and subscales 011 022 024 018

To get the standard sem fit statistics ask for summary on the fitted object

531 Other estimates of reliability

Other estimates of reliability are found by the splitHalf and guttman functions Theseare described in more detail in Revelle and Zinbarg (2009) and in Revelle and Condon

80

(2014) They include the 6 estimates from Guttman four from TenBerge and an estimateof the greatest lower bound

gt splitHalf(r9)

Split half reliabilities

Call splitHalf(r = r9)

Maximum split half reliability (lambda 4) = 084

Guttman lambda 6 = 08

Average split half reliability = 079

Guttman lambda 3 (alpha) = 08

Minimum split half reliability (beta) = 072

54 Reliability and correlations of multiple scales within an inventory

A typical research question in personality involves an inventory of multiple items pur-porting to measure multiple constructs For example the data set bfi includes 25 itemsthought to measure five dimensions of personality (Extraversion Emotional Stability Con-scientiousness Agreeableness and Openness) The data may either be the raw data or acorrelation matrix (scoreItems) or just a correlation matrix of the items ( clustercor

and clusterloadings) When finding reliabilities for multiple scales item reliabilitiescan be estimated using the squared multiple correlation of an item with all other itemsnot just those that are keyed for a particular scale This leads to an estimate of G6

541 Scoring from raw data

To score these five scales from the 25 items use the scoreItems function and a list ofitems to be scored on each scale (a keyslist) Items may be listed by location (convenientbut dangerous) or name (probably safer)

Make a keyslist by by specifying the items for each scale preceding items to be negativelykeyed with a - sign

gt the newer way is probably preferred

gt

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt this can also be done by location--

gt keyslist lt- list(Agree=c(-125)Conscientious=c(68-9-10)

+ Extraversion=c(-11-121315)Neuroticism=c(1620)

+ Openness = c(21-222324-25))

gt These two approaches can be mixed if desired

gt keyslist lt- list(agree=c(-A1A2A3A4A5)conscientious=c(C1C2C2-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

81

+ neuroticism=c(1620)openness = c(21-222324-25))

gt keyslist

$agree

[1] -A1 A2 A3 A4 A5

$conscientious

[1] C1 C2 C2 -C4 -C5

$extraversion

[1] -E1 -E2 E3 E4 E5

$neuroticism

[1] 16 17 18 19 20

$openness

[1] 21 -22 23 24 -25

In the past (prior to version 169 the keyslist was then converted a keys matrix using the helper function makekeys This is no longernecessary Logically scales are merely the weighted composites of a set of items The weights used are -1 0 and 1 0 implies do notuse that item in the scale 1 implies a positive weight (add the item to the total score) -1 a negative weight (subtract the item from thetotal score ie reverse score the item) Reverse scoring an item is equivalent to subtracting the item from the maximum + minimumpossible value for that item The minima and maxima can be estimated from all the items or can be specified by the user

There are two different ways that scale scores tend to be reported Social psychologists and educational psychologists tend to reportthe scale score as the average item score while many personality psychologists tend to report the total item score The default optionfor scoreItems is to report item averages (which thus allows interpretation in the same metric as the items) but totals can be found aswell Personality researchers should be encouraged to report scores based upon item means and avoid using the total score althoughsome reviewers are adamant about the following the tradition of total scores

The printed output includes coefficients α and G6 the average correlation of the items within the scale (corrected for item ovelap andscale relliability) as well as the correlations between the scales (below the diagonal the correlations above the diagonal are correctedfor attenuation As is the case for most of the psych functions additional information is returned as part of the object

First create keys matrix using the makekeys function (The keys matrix could also be prepared externally using a spreadsheet and thencopying it into R) Although not normally necessary show the keys to understand what is happening There are two ways to make upthe keys You can specify the items by location (the old way) or by name (the newer and probably preferred way) To use the newerway you must specify the file on which you will use the keys The example below shows how to construct keys either way

Note that the number of items to specify in the makekeys function is the total number of items in the inventory This is doneautomatically in the new way of forming keys but if using the older way the number must be specified That is if scoring just 5items from a 25 item inventory makekeys should be told that there are 25 items makekeys just changes a list of items on each scale tomake up a scoring matrix Because the bfi data set has 25 items as well as 3 demographic items the number of variables is specifiedas 28

Then use this keys list to score the items

gt scores lt- scoreItems(keyslistbfi)

gt scores

Call scoreItems(keys = keyslist items = bfi)

(Unstandardized) Alpha

agree conscientious extraversion neuroticism openness

alpha 07 069 076 081 06

Standard errors of unstandardized Alpha

agree conscientious extraversion neuroticism openness

ASE 0014 0016 0013 0011 0017

Average item correlation

agree conscientious extraversion neuroticism openness

averager 032 035 039 046 023

Guttman 6 reliability

agree conscientious extraversion neuroticism openness

82

Lambda6 07 069 076 081 06

SignalNoise based upon avr

agree conscientious extraversion neuroticism openness

SignalNoise 23 22 32 43 15

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 070 036 063 -0245 023

conscientious 025 069 037 -0334 033

extraversion 046 027 076 -0284 032

neuroticism -018 -025 -022 0812 -012

openness 015 021 022 -0086 060

In order to see the item by scale loadings and frequency counts of the data

print with the short option = FALSE

To see the additional information (the raw correlations the individual scores etc) theymay be specified by name Then to visualize the correlations between the raw scores usethe pairspanels function on the scores values of scores (See figure 30

542 Forming scales from a correlation matrix

There are some situations when the raw data are not available but the correlation matrixbetween the items is available In this case it is not possible to find individual scores butit is possible to find the reliability and intercorrelations of the scales This may be doneusing the clustercor function or the scoreItems function The use of a keys matrix isthe same as in the raw data case

Consider the same bfi data set but first find the correlations and then use scoreIt-

ems

gt rbfi lt- cor(bfiuse=pairwise)

gt scales lt- scoreItems(keyslistrbfi)

gt summary(scales)

Call scoreItems(keys = keyslist items = rbfi)

Scale intercorrelations corrected for attenuation

raw correlations below the diagonal (standardized) alpha on the diagonal

corrected correlations above the diagonal

agree conscientious extraversion neuroticism openness

agree 071 035 064 -0242 025

conscientious 024 069 038 -0314 033

extraversion 047 027 076 -0278 035

neuroticism -018 -024 -022 0815 -011

openness 016 022 024 -0074 061

To find the correlations of the items with each of the scales (the ldquostructurerdquo matrix) orthe correlations of the items controlling for the other scales (the ldquopatternrdquo matrix) use the

83

gt png(scorespng)gt pairspanels(scores$scorespch=jiggle=TRUE)gt devoff()

pdf

2

Figure 30 A graphic analysis of the Big Five scales found by using the scoreItems functionThe pairwise plot allows us to see that some participants have reached the ceiling of thescale for these 5 items scales Using the pch=rsquorsquo option in pairspanels is recommended whenplotting many cases The data points were ldquojitteredrdquo by setting jiggle=TRUE Jiggling thisway shows the density more clearly To save space the figure was done as a png For aclearer figure save as a pdf

84

clusterloadings function To do both at once (eg the correlations of the scales as wellas the item by scale correlations) it is also possible to just use scoreItems

55 Scoring Multiple Choice Items

Some items (typically associated with ability tests) are not themselves mini-scales rangingfrom low to high levels of expression of the item of interest but are rather multiple choicewhere one response is the correct response Two analyses are useful for this kind of itemexamining the response patterns to all the alternatives (looking for good or bad distractors)and scoring the items as correct or incorrect Both of these operations may be done usingthe scoremultiplechoice function Consider the 16 example items taken from an onlineability test at the Personality Project httptestpersonality-projectorg This ispart of the Synthetic Aperture Personality Assessment (SAPA) study discussed in Revelleet al (2011 2010)

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scoremultiplechoice(iqkeysiqitems)

Call scoremultiplechoice(key = iqkeys data = iqitems)

(Unstandardized) Alpha

[1] 084

Average item correlation

[1] 025

item statistics

key 0 1 2 3 4 5 6 7 8 miss r n mean

reason4 4 005 005 011 010 064 003 002 000 000 0 059 1523 064

reason16 4 004 006 008 010 070 001 000 000 000 0 053 1524 070

reason17 4 005 003 005 003 070 003 011 000 000 0 059 1523 070

reason19 6 004 002 013 003 006 010 062 000 000 0 056 1523 062

letter7 6 005 001 005 003 011 014 060 000 000 0 058 1524 060

letter33 3 006 010 013 057 004 009 002 000 000 0 056 1523 057

letter34 4 004 009 007 011 061 005 002 000 000 0 059 1523 061

letter58 4 006 014 009 009 044 016 001 000 000 0 058 1525 044

matrix45 5 004 001 006 014 018 053 004 000 000 0 051 1523 053

matrix46 2 004 012 055 007 011 006 005 000 000 0 052 1524 055

matrix47 2 004 005 061 007 011 006 006 000 000 0 055 1523 061

matrix55 4 004 002 018 014 037 007 018 000 000 0 045 1524 037

rotate3 3 004 003 004 019 022 015 005 012 015 0 051 1523 019

rotate4 2 004 003 021 005 018 004 004 025 015 0 056 1523 021

rotate6 6 004 022 002 005 014 005 030 004 014 0 055 1523 030

rotate8 7 004 003 021 007 016 005 013 019 013 0 048 1524 019

sd

reason4 048

reason16 046

reason17 046

reason19 049

letter7 049

letter33 050

85

letter34 049

letter58 050

matrix45 050

matrix46 050

matrix47 049

matrix55 048

rotate3 040

rotate4 041

rotate6 046

rotate8 039

gt just convert the items to true or false

gt iqtf lt- scoremultiplechoice(iqkeysiqitemsscore=FALSE)

gt describe(iqtf) compare to previous results

vars n mean sd median trimmed mad min max range skew kurtosis

reason4 1 1523 064 048 1 068 0 0 1 1 -058 -166

reason16 2 1524 070 046 1 075 0 0 1 1 -086 -126

reason17 3 1523 070 046 1 075 0 0 1 1 -086 -126

reason19 4 1523 062 049 1 064 0 0 1 1 -047 -178

letter7 5 1524 060 049 1 062 0 0 1 1 -041 -184

letter33 6 1523 057 050 1 059 0 0 1 1 -029 -192

letter34 7 1523 061 049 1 064 0 0 1 1 -046 -179

letter58 8 1525 044 050 0 043 0 0 1 1 023 -195

matrix45 9 1523 053 050 1 053 0 0 1 1 -010 -199

matrix46 10 1524 055 050 1 056 0 0 1 1 -020 -196

matrix47 11 1523 061 049 1 064 0 0 1 1 -047 -178

matrix55 12 1524 037 048 0 034 0 0 1 1 052 -173

rotate3 13 1523 019 040 0 012 0 0 1 1 155 040

rotate4 14 1523 021 041 0 014 0 0 1 1 140 -003

rotate6 15 1523 030 046 0 025 0 0 1 1 088 -124

rotate8 16 1524 019 039 0 011 0 0 1 1 162 063

se

reason4 001

reason16 001

reason17 001

reason19 001

letter7 001

letter33 001

letter34 001

letter58 001

matrix45 001

matrix46 001

matrix47 001

matrix55 001

rotate3 001

rotate4 001

rotate6 001

rotate8 001

Once the items have been scored as true or false (assigned scores of 1 or 0) they madethen be scored into multiple scales using the normal scoreItems function

86

56 Item analysis

Basic item analysis starts with describing the data (describe finding the number of di-mensions using factor analysis (fa) and cluster analysis iclust perhaps using the VerySimple Structure criterion (vss) or perhaps parallel analysis faparallel Item wholecorrelations may then be found for scales scored on one dimension (alpha or many scalessimultaneously (scoreItems) Scales can be modified by changing the keys matrix (iedropping particular items changing the scale on which an item is to be scored) Thisanalysis can be done on the normal Pearson correlation matrix or by using polychoric cor-relations Validities of the scales can be found using multiple correlation of the raw dataor based upon correlation matrices using the setcor function However more powerfulitem analysis tools are now available by using Item Response Theory approaches

Although the responsefrequencies output from scoremultiplechoice is useful toexamine in terms of the probability of various alternatives being endorsed it is even betterto examine the pattern of these responses as a function of the underlying latent trait orjust the total score This may be done by using irtresponses (Figure 31)

561 Exploring the item structure of scales

The Big Five scales found above can be understood in terms of the item - whole correlationsbut it is also useful to think of the endorsement frequency of the items The itemlookup

function will sort items by their factor loadingitem-whole correlation and then resortthose above a certain threshold in terms of the item means Item content is shown byusing the dictionary developed for those items This allows one to see the structure of eachscale in terms of its endorsement range This is a simple way of thinking of items that isalso possible to do using the various IRT approaches discussed later

gt m lt- colMeans(bfinarm=TRUE)

gt itemlookup(scales$itemcorrected[13]mdictionary=bfidictionary[12])

agree conscientious extraversion means ItemLabel

A1 -040 -006 -010 241 q_146

E1 -031 -007 -059 297 q_712

E2 -039 -026 -070 314 q_901

E3 045 021 061 400 q_1205

E4 051 024 068 442 q_1410

E5 035 040 056 442 q_1768

A5 062 022 055 456 q_1419

A3 070 022 048 460 q_1206

A4 049 030 030 470 q_1364

A2 067 021 040 480 q_1162

C4 -023 -067 -023 255 q_626

N4 -022 -031 -039 319 q_1479

C5 -026 -058 -029 330 q_1949

C3 021 056 015 430 q_619

C2 021 061 018 437 q_530

E51 035 040 056 442 q_1768

C1 013 054 020 450 q_124

E11 -031 -007 -059 297 q_712

E21 -039 -026 -070 314 q_901

N41 -022 -031 -039 319 q_1479

E31 045 021 061 400 q_1205

E41 051 024 068 442 q_1410

87

gt data(iqitems)

gt iqkeys lt- c(444 66344 5224 3267)

gt scores lt- scoremultiplechoice(iqkeysiqitemsscore=TRUEshort=FALSE)

gt note that for speed we can just do this on simple item counts rather than IRT based scores

gt op lt- par(mfrow=c(22)) set this to see the output for multiple items

gt irtresponses(scores$scoresiqitems[14]breaks=11)

00 02 04 06 08 10

00

04

08

reason4

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 100

00

40

8

reason16

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason17

theta

P(r

espo

nse)

01

23

45

6

00 02 04 06 08 10

00

04

08

reason19

theta

P(r

espo

nse)

01

23

45

6

Figure 31 The pattern of responses to multiple choice ability items can show that someitems have poor distractors This may be done by using the the irtresponses functionA good distractor is one that is negatively related to ability

88

E52 035 040 056 442 q_1768

O3 026 022 043 444 q_492

A51 062 022 055 456 q_1419

A31 070 022 048 460 q_1206

A21 067 021 040 480 q_1162

O1 017 021 033 482 q_128

Item

A1 Am indifferent to the feelings of others

E1 Dont talk a lot

E2 Find it difficult to approach others

E3 Know how to captivate people

E4 Make friends easily

E5 Take charge

A5 Make people feel at ease

A3 Know how to comfort others

A4 Love children

A2 Inquire about others well-being

C4 Do things in a half-way manner

N4 Often feel blue

C5 Waste my time

C3 Do things according to a plan

C2 Continue until everything is perfect

E51 Take charge

C1 Am exacting in my work

E11 Dont talk a lot

E21 Find it difficult to approach others

N41 Often feel blue

E31 Know how to captivate people

E41 Make friends easily

E52 Take charge

O3 Carry the conversation to a higher level

A51 Make people feel at ease

A31 Know how to comfort others

A21 Inquire about others well-being

O1 Am full of ideas

562 Empirical scale construction

There are some situations where one wants to identify those items that most relate to aparticular criterion Although this will capitalize on chance and the results should inter-preted cautiously it does give a feel for what is being measured Consider the followingexample from the bfi data set The items that best predicted gender education and agemay be found using the bestScales function This also shows the use of a dictionary thathas the item content

gt data(bfi)

gt bestScales(bficriteria=c(gendereducationage)cut=1dictionary=bfidictionary[13])

The items most correlated with the criteria yield rs of

correlation nitems

gender 032 9

education 014 1

age 024 9

The best items their correlations and content are

$gender

Rownames gender ItemLabel Item

1 N5 02106171 q_1505 Panic easily

2 A2 01820202 q_1162 Inquire about others well-being

3 A1 -01571373 q_146 Am indifferent to the feelings of others

4 A3 01401320 q_1206 Know how to comfort others

5 A4 01261747 q_1364 Love children

6 E1 -01261018 q_712 Dont talk a lot

89

7 N3 01207718 q_1099 Have frequent mood swings

8 O1 -01031059 q_128 Am full of ideas

9 A5 01007091 q_1419 Make people feel at ease

Giant3

1 Stability

2 Cohesion

3 Cohesion

4 Cohesion

5 Cohesion

6 Plasticity

7 Stability

8 Plasticity

9 Cohesion

$education

Rownames education ItemLabel Item

1 A1 -01415734 q_146 Am indifferent to the feelings of others

Giant3

1 Cohesion

$age

Rownames age ItemLabel Item

1 A1 -01609637 q_146 Am indifferent to the feelings of others

2 C4 -01482659 q_626 Do things in a half-way manner

3 A4 01442473 q_1364 Love children

4 A5 01290935 q_1419 Make people feel at ease

5 E5 01146922 q_1768 Take charge

6 A2 01142767 q_1162 Inquire about others well-being

7 N3 -01108648 q_1099 Have frequent mood swings

8 E2 -01051612 q_901 Find it difficult to approach others

9 N5 -01043322 q_1505 Panic easily

Giant3

1 Cohesion

2 Stability

3 Cohesion

4 Cohesion

5 Plasticity

6 Cohesion

7 Stability

8 Plasticity

9 Stability

6 Item Response Theory analysis

The use of Item Response Theory has become is said to be the ldquonew psychometricsrdquo Theemphasis is upon item properties particularly those of item difficulty or location and itemdiscrimination These two parameters are easily found from classic techniques when usingfactor analyses of correlation matrices formed by polychoric or tetrachoric correlationsThe irtfa function does this and then graphically displays item discrimination and itemlocation as well as item and test information (see Figure 32)

90

61 Factor analysis and Item Response Theory

If the correlations of all of the items reflect one underlying latent variable then factoranalysis of the matrix of tetrachoric correlations should allow for the identification of theregression slopes (α) of the items on the latent variable These regressions are of coursejust the factor loadings Item difficulty δ j and item discrimination α j may be found fromfactor analysis of the tetrachoric correlations where λ j is just the factor loading on the firstfactor and τ j is the normal threshold reported by the tetrachoric function

δ j =Dτradic1minusλ 2

j

α j =λ jradic

1minusλ 2j

(2)

where D is a scaling factor used when converting to the parameterization of logistic modeland is 1702 in that case and 1 in the case of the normal ogive model Thus in the case ofthe normal model factor loadings (λ j) and item thresholds (τ) are just

λ j =α jradic

1 + α2j

τ j =δ jradic

1 + α2j

Consider 9 dichotomous items representing one factor but differing in their levels of diffi-culty

gt setseed(17)

gt d9 lt- simirt(91000-2525mod=normal) dichotomous items

gt test lt- irtfa(d9$items)

gt test

Item Response Analysis using Factor Analysis

Call irtfa(x = d9$items)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

V1 048 059 024 006 001 000 000

V2 030 068 045 012 002 000 000

V3 012 050 074 029 006 001 000

V4 005 026 074 055 014 003 000

V5 001 007 044 105 041 006 001

V6 000 003 015 060 074 024 004

V7 000 001 004 022 073 063 016

V8 000 000 002 012 045 069 031

V9 000 001 002 008 025 047 036

Test Info 098 214 285 309 281 214 089

SEM 101 068 059 057 060 068 106

Reliability -002 053 065 068 064 053 -012

91

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 27 and the objective function was 12

The number of observations was 1000 with Chi Square = 119558 with prob lt 73e-235

The root mean square of the residuals (RMSA) is 009

The df corrected root mean square of the residuals is 01

Tucker Lewis Index of factoring reliability = 0699

RMSEA index = 0043 and the 90 confidence intervals are 0043 0218

BIC = 100907

Similar analyses can be done for polytomous items such as those of the bfi extraversionscale

gt data(bfi)

gt eirt lt- irtfa(bfi[1115])

gt eirt

Item Response Analysis using Factor Analysis

Call irtfa(x = bfi[1115])

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E1 037 058 075 076 061 039 021

E2 048 089 118 124 099 059 025

E3 034 049 058 057 049 036 023

E4 063 098 111 096 067 037 016

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 5 and the objective function was 005

The number of observations was 2800 with Chi Square = 13592 with prob lt 13e-27

The root mean square of the residuals (RMSA) is 004

The df corrected root mean square of the residuals is 006

Tucker Lewis Index of factoring reliability = 0932

RMSEA index = 0009 and the 90 confidence intervals are 0009 0111

BIC = 9623

The item information functions show that not all of items are equally good (Figure 33)

These procedures can be generalized to more than one factor by specifying the number of

92

gt op lt- par(mfrow=c(31))

gt plot(testtype=ICC)

gt plot(testtype=IIC)

gt plot(testtype=test)

gt op lt- par(mfrow=c(11))

minus3 minus2 minus1 0 1 2 3

00

04

08

Item parameters from factor analysis

Latent Trait (normal scale)

Pro

babi

lity

of R

espo

nse

V1 V2 V3 V4 V5 V6 V7 V8 V9

minus3 minus2 minus1 0 1 2 3

00

04

08

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

V1 V2 V3 V4

V5V6 V7

V8V9

minus3 minus2 minus1 0 1 2 3

00

15

30

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

minus0

290

57R

elia

bilit

y

Figure 32 A graphic analysis of 9 dichotomous (simulated) items The top panel showsthe probability of item endorsement as the value of the latent trait increases Items differin their location (difficulty) and discrimination (slope) The middle panel shows the infor-mation in each item as a function of latent trait level An item is most informative whenthe probability of endorsement is 50 The lower panel shows the total test informationThese items form a test that is most informative (most accurate) at the middle range ofthe latent trait

93

gt einfo lt- plot(eirttype=IIC)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

12

Item information from factor analysis for factor 1

Latent Trait (normal scale)

Item

Info

rmat

ion minusE1

minusE2

E3

E4

E5

Figure 33 A graphic analysis of 5 extraversion items from the bfi The curves representthe amount of information in the item as a function of the latent score for an individualThat is each item is maximally discriminating at a different part of the latent continuumPrint einfo to see the average information for each item

94

factors in irtfa The plots can be limited to those items with discriminations greaterthan some value of cut An invisible object is returned when plotting the output fromirtfa that includes the average information for each item that has loadings greater thancut

gt print(einfosort=TRUE)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

E2 048 089 118 124 099 059 025

E4 063 098 111 096 067 037 016

E1 037 058 075 076 061 039 021

E3 034 049 058 057 049 036 023

E5 033 042 047 045 037 027 017

Test Info 215 336 409 398 313 197 102

SEM 068 055 049 050 056 071 099

Reliability 053 070 076 075 068 049 002

More extensive IRT packages include the ltm and eRm and should be used for serious ItemResponse Theory analysis

62 Speeding up analyses

Finding tetrachoric or polychoric correlations is very time consuming Thus to speedup the process of analysis the original correlation matrix is saved as part of the outputof both irtfa and omega Subsequent analyses may be done by using this correlationmatrix This is done by doing the analysis not on the original data but rather on theoutput of the previous analysis

In addition recent releases of the psych take advantage of the parallels package and usemulti-cores The default for Macs and Unix machines is to use two cores but this can beincreased using the options command The biggest step up in improvement is from 1 to2 cores but for large problems using polychoric correlations the more cores available thebetter

For example of taking the output from the 16 ability items from the SAPA project whenscored for TrueFalse using scoremultiplechoice we can first do a simple IRT analysisof one factor (Figure 36) and then use that correlation matrix to do an omega analysis toshow the sub-structure of the ability items We can also show the total test information(merely the sum of the item information This shows that even with just 16 items the testis very reliable for most of the range of ability The fairt function saves the correlationmatrix and item statistics so that they can be redrawn with other options

detectCores() how many are available

options(mccores) how many have been set to be used

95

options(mccores=4) set to use 4 cores

gt iqirt

Item Response Analysis using Factor Analysis

Call irtfa(x = ability)

Item Response Analysis using Factor Analysis

Summary information by factor and item

Factor = 1

-3 -2 -1 0 1 2 3

reason4 005 024 064 053 016 003 001

reason16 008 022 038 031 014 005 001

reason17 008 033 069 042 011 002 000

reason19 006 017 035 036 019 007 002

letter7 005 018 041 044 020 006 002

letter33 005 015 031 036 020 008 002

letter34 005 019 045 046 020 006 001

letter58 002 009 030 053 035 012 003

matrix45 005 011 019 023 017 009 004

matrix46 005 012 022 026 018 009 004

matrix47 006 017 033 034 019 007 002

matrix55 004 008 013 017 016 011 006

rotate3 000 001 006 030 075 049 012

rotate4 000 001 005 031 100 056 010

rotate6 001 003 015 053 069 025 005

rotate8 000 002 008 029 059 041 013

Test Info 067 211 473 583 528 255 069

SEM 122 069 046 041 044 063 120

Reliability -049 053 079 083 081 061 -045

Factor analysis with Call fa(r = r nfactors = nfactors nobs = nobs rotate = rotate

fm = fm)

Test of the hypothesis that 1 factor is sufficient

The degrees of freedom for the model is 104 and the objective function was 191

The number of observations was 1525 with Chi Square = 289345 with prob lt 0

The root mean square of the residuals (RMSA) is 008

The df corrected root mean square of the residuals is 009

Tucker Lewis Index of factoring reliability = 0723

RMSEA index = 0018 and the 90 confidence intervals are 0018 0137

BIC = 213115

63 IRT based scoring

The primary advantage of IRT analyses is examining the item properties (both difficultyand discrimination) With complete data the scores based upon simple total scores andbased upon IRT are practically identical (this may be seen in the examples for scoreIrt)However when working with data such as those found in the Synthetic Aperture PersonalityAssessment (SAPA) project it is advantageous to use IRT based scoring SAPA datamight have 2-3 itemsperson sampled from scales with 10-20 items Simply finding the

96

gt iqirt lt- irtfa(ability)

minus3 minus2 minus1 0 1 2 3

00

02

04

06

08

10

Item information from factor analysis

Latent Trait (normal scale)

Item

Info

rmat

ion

reason4

reason16

reason17

reason19

letter7

letter33

letter34letter58

matrix45matrix46

matrix47

matrix55

rotate3

rotate4

rotate6

rotate8

Figure 34 A graphic analysis of 16 ability items sampled from the SAPA project Thecurves represent the amount of information in the item as a function of the latent scorefor an individual That is each item is maximally discriminating at a different part ofthe latent continuum Print iqirt to see the average information for each item Partlybecause this is a power test (it is given on the web) and partly because the items have notbeen carefully chosen the items are not very discriminating at the high end of the abilitydimension

97

gt plot(iqirttype=test)

minus3 minus2 minus1 0 1 2 3

01

23

45

6

Test information minusminus item parameters from factor analysis

Latent Trait (normal scale)

Test

Info

rmat

ion

031

066

077

083

Rel

iabi

lity

Figure 35 A graphic analysis of 16 ability items sampled from the SAPA project Thetotal test information at all levels of difficulty may be shown by specifying the type=rsquotestrsquooption in the plot function

98

gt om lt- omega(iqirt$rho4)

Omega

rotate3

rotate4

rotate8

rotate6

letter34

letter7

letter33

letter58

matrix47

matrix45

matrix46

matrix55

reason17

reason4

reason19

reason16

F1

07060606

F2

05

05

04

03

F30703

F407

02

02

g

06

06

05

06

06

06

06

06

05

05

05

04

07

06

05

05

Figure 36 An Omega analysis of 16 ability items sampled from the SAPA project Theitems represent a general factor as well as four lower level factors The analysis is doneusing the tetrachoric correlations found in the previous irtfa analysis The four matrixitems have some serious problems which may be seen later when examine the item responsefunctions

99

average of the three (classical test theory) fails to consider that the items might differin either discrimination or in difficulty The scoreIrt function applies basic IRT to thisproblem

Consider 1000 randomly generated subjects with scores on 9 truefalse items differing indifficulty Selectively drop the hardest items for the 13 lowest subjects and the 4 easiestitems for the 13 top subjects (this is a crude example of what tailored testing would do)Then score these subjects

gt v9 lt- simirt(91000-22mod=normal) dichotomous items

gt items lt- v9$items

gt test lt- irtfa(items)

gt total lt- rowSums(items)

gt ord lt- order(total)

gt items lt- items[ord]

gt now delete some of the data - note that they are ordered by score

gt items[133359] lt- NA

gt items[33466637] lt- NA

gt items[667100014] lt- NA

gt scores lt- scoreIrt(testitems)

gt unitweighted lt- scoreIrt(items=itemskeys=rep(19))

gt scoresdf lt- dataframe(true=v9$theta[ord]scoresunitweighted)

gt colnames(scoresdf) lt- c(True thetairt thetatotalfitraschtotalfit)

These results are seen in Figure 37

631 1 versus 2 parameter IRT scoring

In Item Response Theory items can be assumed to be equally discriminating but to differin their difficulty (the Rasch model) or to vary in their discriminability Two functions(scoreIrt1pl and scoreIrt2pl) are meant to find multiple IRT based scales using theRasch model or the 2 parameter model Both allow for negatively keyed as well as positivelykeyed items Consider the bfi data set with scoring keys keylist and items listed as anitemlist (This is the same as the keylist but with the negative signs removed)

gt keyslist lt- list(agree=c(-A1A2A3A4A5)

+ conscientious=c(C1C2C3-C4-C5)

+ extraversion=c(-E1-E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1-O2O3O4-O5))

gt itemlist lt- list(agree=c(A1A2A3A4A5)

+ conscientious=c(C1C2C3C4C5)

+ extraversion=c(E1E2E3E4E5)

+ neuroticism=c(N1N2N3N4N5)

+ openness = c(O1O2O3O4O5))

gt bfi1pl lt- scoreIrt1pl(keyslistbfi) the one parameter solution

gt bfi2pl lt- scoreIrt2pl(itemlistbfi) the two parameter solution

gt bfictt lt- scoreFast(keyslistbfi) fast scoring function

gt

100

Figure 37 IRT based scoring and total test scores for 1000 simulated subjects True thetavalues are reported and then the IRT and total scoring systemsgt pairspanels(scoresdfpch=gap=0)gt title(Comparing true theta for IRT Rasch and classically based scoringline=3)

True theta

minus2 0 1 2

083 040

04 08 12

006 055

00 04 08

040

minus3

02

minus019

minus2

01

2

irt theta

050 008 068 050 minus020

total

010 096 100

00

04

08

minus001

04

08

12

fit

minus001 010 081

rasch

096minus

10

05

minus014

00

04

08

total

minus001

minus3 0 2

00 04 08

minus10 05

06 10 14

06

10

14

fit

Comparing true theta for IRT Rasch and classically based scoring

101

We can compare these three ways of doing the analysis using the cor2 function whichcorrelates two separate data frames All three models produce vey simillar results for thecase of almost complete data It is when we have massively missing completely at randomdata (MMCAR) that the results show the superiority of the irt scoring

gt compare the solutions using the cor2 function

gt cor2(bfi1plbfictt)

agree conscientious extraversion neuroticism openness

agree 093 027 044 -020 019

conscientious 025 097 025 -022 017

extraversion 043 025 094 -023 023

neuroticism -019 -023 -022 100 -009

openness 012 019 019 -007 095

gt cor2(bfi2plbfictt)

agree conscientious extraversion neuroticism openness

agree 098 025 049 -017 015

conscientious 026 097 025 -023 018

extraversion 043 024 094 -025 021

neuroticism -019 -022 -018 098 -008

openness 017 021 027 -011 098

gt cor2(bfi2plbfi1pl)

agree conscientious extraversion neuroticism openness

agree 088 025 045 -017 012

conscientious 026 100 024 -023 016

extraversion 043 023 099 -025 020

neuroticism -021 -021 -019 098 -007

openness 020 019 028 -011 092

7 Multilevel modeling

Correlations between individuals who belong to different natural groups (based upon egethnicity age gender college major or country) reflect an unknown mixture of the pooledcorrelation within each group as well as the correlation of the means of these groupsThese two correlations are independent and do not allow inferences from one level (thegroup) to the other level (the individual) When examining data at two levels (eg theindividual and by some grouping variable) it is useful to find basic descriptive statistics(means sds ns per group within group correlations) as well as between group statistics(over all descriptive statistics and overall between group correlations) Of particular useis the ability to decompose a matrix of correlations at the individual level into correlationswithin group and correlations between groups

102

71 Decomposing data into within and between level correlations usingstatsBy

There are at least two very powerful packages (nlme and multilevel) which allow for complexanalysis of hierarchical (multilevel) data structures statsBy is a much simpler functionto give some of the basic descriptive statistics for two level models

This follows the decomposition of an observed correlation into the pooled correlation withingroups (rwg) and the weighted correlation of the means between groups which is discussedby Pedhazur (1997) and by Bliese (2009) in the multilevel package

rxy = ηxwg lowastηywg lowast rxywg + ηxbg lowastηybg lowast rxybg (3)

where rxy is the normal correlation which may be decomposed into a within group andbetween group correlations rxywg and rxybg and η (eta) is the correlation of the data withthe within group values or the group means

72 Generating and displaying multilevel data

withinBetween is an example data set of the mixture of within and between group cor-relations The within group correlations between 9 variables are set to be 1 0 and -1while those between groups are also set to be 1 0 -1 These two sets of correlations arecrossed such that V1 V4 and V7 have within group correlations of 1 as do V2 V5 andV8 and V3 V6 and V9 V1 has a within group correlation of 0 with V2 V5 and V8and a -1 within group correlation with V3 V6 and V9 V1 V2 and V3 share a betweengroup correlation of 1 as do V4 V5 and V6 and V7 V8 and V9 The first group has a 0between group correlation with the second and a -1 with the third group See the help filefor withinBetween to display these data

simmultilevel will generate simulated data with a multilevel structure

The statsByboot function will randomize the grouping variable ntrials times and find thestatsBy output This can take a long time and will produce a great deal of output Thisoutput can then be summarized for relevant variables using the statsBybootsummary

function specifying the variable of interest

Consider the case of the relationship between various tests of ability when the data aregrouped by level of education (statsBy(satact)) or when affect data are analyzed withinand between an affect manipulation (statsBy(affect) )

103

73 Factor analysis by groups

Confirmatory factor analysis comparing the structures in multiple groups can be donein the lavaan package However for exploratory analyses of the structure within each ofmultiple groups the faBy function may be used in combination with the statsBy functionFirst run pfunstatsBy with the correlation option set to TRUE and then run faBy on theresulting output

sb lt- statsBy(bfi[c(12527)] group=educationcors=TRUE)

faBy(sbnfactors=5) find the 5 factor solution for each education level

8 Set Correlation and Multiple Regression from the corre-lation matrix

An important generalization of multiple regression and multiple correlation is set correla-tion developed by Cohen (1982) and discussed by Cohen et al (2003) Set correlation isa multivariate generalization of multiple regression and estimates the amount of varianceshared between two sets of variables Set correlation also allows for examining the relation-ship between two sets when controlling for a third set This is implemented in the setCor

function Set correlation is

R2 = 1minusn

prodi=1

(1minusλi)

where λi is the ith eigen value of the eigen value decomposition of the matrix

R = Rminus1xx RxyRminus1

xx Rminus1xy

Unfortunately there are several cases where set correlation will give results that are muchtoo high This will happen if some variables from the first set are highly related to thosein the second set even though most are not In this case although the set correlationcan be very high the degree of relationship between the sets is not as high In thiscase an alternative statistic based upon the average canonical correlation might be moreappropriate

setCor has the additional feature that it will calculate multiple and partial correlationsfrom the correlation or covariance matrix rather than the original data

Consider the correlations of the 6 variables in the satact data set First do the normalmultiple regression and then compare it with the results using setCor Two things tonotice setCor works on the correlation or covariance or raw data matrix and thus ifusing the correlation matrix will report standardized or raw β weights Secondly it ispossible to do several multiple regressions simultaneously If the number of observations

104

is specified or if the analysis is done on raw data statistical tests of significance areapplied

For this example the analysis is done on the correlation matrix rather than the rawdata

gt C lt- cov(satactuse=pairwise)

gt model1 lt- lm(ACT~ gender + education + age data=satact)

gt summary(model1)

Call

lm(formula = ACT ~ gender + education + age data = satact)

Residuals

Min 1Q Median 3Q Max

-252458 -32133 07769 35921 92630

Coefficients

Estimate Std Error t value Pr(gt|t|)

(Intercept) 2741706 082140 33378 lt 2e-16

gender -048606 037984 -1280 020110

education 047890 015235 3143 000174

age 001623 002278 0712 047650

---

Signif codes 0 0001 001 005 01 1

Residual standard error 4768 on 696 degrees of freedom

Multiple R-squared 00272 Adjusted R-squared 002301

F-statistic 6487 on 3 and 696 DF p-value 00002476

Compare this with the output from setCor

gt compare with sector

gt setCor(c(46)c(13)C nobs=700)

Call setCor(y = c(46) x = c(13) data = C nobs = 700)

Multiple Regression from matrix input

Beta weights

ACT SATV SATQ

gender -005 -003 -018

education 014 010 010

age 003 -010 -009

Multiple R

ACT SATV SATQ

016 010 019

multiple R2

ACT SATV SATQ

00272 00096 00359

Unweighted multiple R

ACT SATV SATQ

015 005 011

Unweighted multiple R2

ACT SATV SATQ

002 000 001

105

SE of Beta weights

ACT SATV SATQ

gender 018 429 434

education 022 513 518

age 022 511 516

t of Beta Weights

ACT SATV SATQ

gender -027 -001 -004

education 065 002 002

age 015 -002 -002

Probability of t lt

ACT SATV SATQ

gender 079 099 097

education 051 098 098

age 088 098 099

Shrunken R2

ACT SATV SATQ

00230 00054 00317

Standard Error of R2

ACT SATV SATQ

00120 00073 00137

F

ACT SATV SATQ

649 226 863

Probability of F lt

ACT SATV SATQ

248e-04 808e-02 124e-05

degrees of freedom of regression

[1] 3 696

Various estimates of between set correlations

Squared Canonical Correlations

[1] 0050 0033 0008

Chisq of canonical correlations

[1] 358 231 56

Average squared canonical correlation = 003

Cohens Set Correlation R2 = 009

Shrunken Set Correlation R2 = 008

F and df of Cohens Set Correlation 726 9 168186

Unweighted correlation between the two sets = 001

Note that the setCor analysis also reports the amount of shared variance between thepredictor set and the criterion (dependent) set This set correlation is symmetric That isthe R2 is the same independent of the direction of the relationship

106

9 Simulation functions

It is particularly helpful when trying to understand psychometric concepts to be ableto generate sample data sets that meet certain specifications By knowing ldquotruthrdquo it ispossible to see how well various algorithms can capture it Several of the sim functionscreate artificial data sets with known structures

A number of functions in the psych package will generate simulated data These functionsinclude sim for a factor simplex and simsimplex for a data simplex simcirc for a cir-cumplex structure simcongeneric for a one factor factor congeneric model simdichotto simulate dichotomous items simhierarchical to create a hierarchical factor modelsimitem is a more general item simulation simminor to simulate major and minor fac-tors simomega to test various examples of omega simparallel to compare the efficiencyof various ways of determining the number of factors simrasch to create simulated raschdata simirt to create general 1 to 4 parameter IRT data by calling simnpl 1 to 4parameter logistic IRT or simnpn 1 to 4 paramater normal IRT simstructural a gen-eral simulation of structural models and simanova for ANOVA and lm simulations andsimvss Some of these functions are separately documented and are listed here for easeof the help function See each function for more detailed help

sim The default version is to generate a four factor simplex structure over three occasionsalthough more general models are possible

simsimple Create major and minor factors The default is for 12 variables with 3 majorfactors and 6 minor factors

simstructure To combine a measurement and structural model into one data matrixUseful for understanding structural equation models

simhierarchical To create data with a hierarchical (bifactor) structure

simcongeneric To create congeneric itemstests for demonstrating classical test theoryThis is just a special case of simstructure

simcirc To create data with a circumplex structure

simitem To create items that either have a simple structure or a circumplex structure

simdichot Create dichotomous item data with a simple or circumplex structure

simrasch Simulate a 1 parameter logistic (Rasch) model

simirt Simulate a 2 parameter logistic (2PL) or 2 parameter Normal model Will alsodo 3 and 4 PL and PN models

107

simmultilevel Simulate data with different within group and between group correla-tional structures

Some of these functions are described in more detail in the companion vignette psych forsem

The default values for simstructure is to generate a 4 factor 12 variable data set witha simplex structure between the factors

Two data structures that are particular challenges to exploratory factor analysis are thesimplex structure and the presence of minor factors Simplex structures simsimplex willtypically occur in developmental or learning contexts and have a correlation structure of rbetween adjacent variables and rn for variables n apart Although just one latent variable(r) needs to be estimated the structure will have nvar-1 factors

Many simulations of factor structures assume that except for the major factors all residualsare normally distributed around 0 An alternative and perhaps more realistic situation isthat the there are a few major (big) factors and many minor (small) factors The challengeis thus to identify the major factors simminor generates such structures The structuresgenerated can be thought of as having a a major factor structure with some small correlatedresiduals

Although coefficient ωh is a very useful indicator of the general factor saturation of aunifactorial test (one with perhaps several sub factors) it has problems with the case ofmultiple independent factors In this situation one of the factors is labelled as ldquogeneralrdquoand the omega estimate is too large This situation may be explored using the simomega

function

The four irt simulations simrasch simirt simnpl and simnpn simulate dichoto-mous items following the Item Response model simirt just calls either simnpl (forlogistic models) or simnpn (for normal models) depending upon the specification of themodel

The logistic model is

P(x|θiδ jγ jζ j) = γ j +ζ jminus γ j

1 + eα j(δ jminusθi (4)

where γ is the lower asymptote or guessing parameter ζ is the upper asymptote (nor-mally 1) α j is item discrimination and δ j is item difficulty For the 1 Paramater Logistic(Rasch) model gamma=0 zeta=1 alpha=1 and item difficulty is the only free parameterto specify

(Graphics of these may be seen in the demonstrations for the logistic function)

The normal model (irtnpn calculates the probability using pnorm instead of the logisticfunction used in irtnpl but the meaning of the parameters are otherwise the same

108

With the a = α parameter = 1702 in the logiistic model the two models are practicallyidentical

10 Graphical Displays

Many of the functions in the psych package include graphic output and examples havebeen shown in the previous figures After running fa iclust omega irtfa plotting theresulting object is done by the plotpsych function as well as specific diagram functionseg (but not shown)

f3 lt- fa(Thurstone3)

plot(f3)

fadiagram(f3)

c lt- iclust(Thurstone)

plot(c) a pretty boring plot

iclustdiagram(c) a better diagram

c3 lt- iclust(Thurstone3)

plot(c3) a more interesting plot

data(bfi)

eirt lt- irtfa(bfi[1115])

plot(eirt)

ot lt- omega(Thurstone)

plot(ot)

omegadiagram(ot)

The ability to show path diagrams to represent factor analytic and structural models is dis-cussed in somewhat more detail in the accompanying vignette psych for sem Basic routinesto draw path diagrams are included in the diarect and accompanying functions Theseare used by the fadiagram structurediagram and iclustdiagram functions

11 Converting output to APA style tables using LATEX

Although for most purposes using the Sweave or KnitR packages produces clean outputsome prefer output pre formatted for APA style tables This can be done using the xtablepackage for almost anything but there are a few simple functions in psych for the mostcommon tables fa2latex will convert a factor analysis or components analysis output toa LATEXtable cor2latex will take a correlation matrix and show the lower (or upper diag-onal) irt2latex converts the item statistics from the irtfa function to more convenientLATEXoutput and finally df2latex converts a generic data frame to LATEX

An example of converting the output from fa to LATEXappears in Table 11

109

gt xlim=c(010)

gt ylim=c(010)

gt plot(NAxlim=xlimylim=ylimmain=Demontration of dia functionsaxes=FALSExlab=ylab=)

gt ul lt- diarect(19labels=upper leftxlim=xlimylim=ylim)

gt ll lt- diarect(13labels=lower leftxlim=xlimylim=ylim)

gt lr lt- diaellipse(93lower rightxlim=xlimylim=ylim)

gt ur lt- diaellipse(99upper rightxlim=xlimylim=ylim)

gt ml lt- diaellipse(36middle leftxlim=xlimylim=ylim)

gt mr lt- diaellipse(76middle rightxlim=xlimylim=ylim)

gt bl lt- diaellipse(11bottom leftxlim=xlimylim=ylim)

gt br lt- diarect(91bottom rightxlim=xlimylim=ylim)

gt diaarrow(from=lrto=ullabels=right to left)

gt diaarrow(from=ulto=urlabels=left to right)

gt diacurvedarrow(from=lrto=ll$rightlabels =right to left)

gt diacurvedarrow(to=urfrom=ul$rightlabels =left to right)

gt diacurve(ll$topul$bottomdouble-1) for rectangles specify where to point

gt diacurvedarrow(mrurup) but for ellipses just point to it

gt diacurve(mlmracross)

gt diaarrow(urlrtop down)

gt diacurvedarrow(br$toplr$bottomup)

gt diacurvedarrow(blbrleft to right)

gt diaarrow(blll$bottom)

gt diacurvedarrow(mlll$topscale=-1)

gt diacurvedarrow(mrlr$top)

Demontration of dia functions

upper left

lower left lower right

upper right

middle left middle right

bottom left bottom right

right to left

left to right

right to left

left to right

double

up

across

top down

upleft to right

Figure 38 The basic graphic capabilities of the dia functions are shown in this figure110

Table 11 fa2latexA factor analysis table from the psych package in R

Variable MR1 MR2 MR3 h2 u2 com

Sentences 091 -004 004 082 018 101Vocabulary 089 006 -003 084 016 101SentCompletion 083 004 000 073 027 100FirstLetters 000 086 000 073 027 1004LetterWords -001 074 010 063 037 104Suffixes 018 063 -008 050 050 120LetterSeries 003 -001 084 072 028 100Pedigrees 037 -005 047 050 050 193LetterGroup -006 021 064 053 047 123

SS loadings 264 186 15

MR1 100 059 054MR2 059 100 052MR3 054 052 100

12 Miscellaneous functions

A number of functions have been developed for some very specific problems that donrsquot fitinto any other category The following is an incomplete list Look at the Index for psychfor a list of all of the functions

blockrandom Creates a block randomized structure for n independent variables Usefulfor teaching block randomization for experimental design

df2latex is useful for taking tabular output (such as a correlation matrix or that of de-

scribe and converting it to a LATEX table May be used when Sweave is not conve-nient

cor2latex Will format a correlation matrix in APA style in a LATEX table See alsofa2latex and irt2latex

cosinor One of several functions for doing circular statistics This is important whenstudying mood effects over the day which show a diurnal pattern See also circa-

dianmean circadiancor and circadianlinearcor for finding circular meanscircular correlations and correlations of circular with linear data

fisherz Convert a correlation to the corresponding Fisher z score

111

geometricmean also harmonicmean find the appropriate mean for working with differentkinds of data

ICC and cohenkappa are typically used to find the reliability for raters

headtail combines the head and tail functions to show the first and last lines of a dataset or output

topBottom Same as headtail Combines the head and tail functions to show the first andlast lines of a data set or output but does not add ellipsis between

mardia calculates univariate or multivariate (Mardiarsquos test) skew and kurtosis for a vectormatrix or dataframe

prep finds the probability of replication for an F t or r and estimate effect size

partialr partials a y set of variables out of an x set and finds the resulting partialcorrelations (See also setcor)

rangeCorrection will correct correlations for restriction of range

reversecode will reverse code specified items Done more conveniently in most psychfunctions but supplied here as a helper function when using other packages

superMatrix Takes two or more matrices eg A and B and combines them into a ldquoSupermatrixrdquo with A on the top left B on the lower right and 0s for the other twoquadrants A useful trick when forming complex keys or when forming exampleproblems

13 Data sets

A number of data sets for demonstrating psychometric techniques are included in thepsych package These include six data sets showing a hierarchical factor structure (fivecognitive examples Thurstone Thurstone33 Holzinger Bechtoldt1 Bechtoldt2and one from health psychology Reise) One of these (Thurstone) is used as an examplein the sem package as well as McDonald (1999) The original data are from Thurstone andThurstone (1941) and reanalyzed by Bechtoldt (1961) Personality item data representingfive personality factors on 25 items (bfi) or 13 personality inventory scores (epibfi) and14 multiple choice iq items (iqitems) The vegetables example has paired comparisonpreferences for 9 vegetables This is an example of Thurstonian scaling used by Guilford(1954) and Nunnally (1967) Other data sets include cubits peas and heights fromGalton

112

Thurstone Holzinger-Swineford (1937) introduced the bifactor model of a general factorand uncorrelated group factors The Holzinger correlation matrix is a 14 14 matrixfrom their paper The Thurstone correlation matrix is a 9 9 matrix of correlationsof ability items The Reise data set is 16 16 correlation matrix of mental healthitems The Bechtholdt data sets are both 17 x 17 correlation matrices of ability tests

bfi 25 personality self report items taken from the International Personality Item Pool(ipiporiorg) were included as part of the Synthetic Aperture Personality Assessment(SAPA) web based personality assessment project The data from 2800 subjects areincluded here as a demonstration set for scale construction factor analysis and ItemResponse Theory analyses

satact Self reported scores on the SAT Verbal SAT Quantitative and ACT were col-lected as part of the Synthetic Aperture Personality Assessment (SAPA) web basedpersonality assessment project Age gender and education are also reported Thedata from 700 subjects are included here as a demonstration set for correlation andanalysis

epibfi A small data set of 5 scales from the Eysenck Personality Inventory 5 from a Big 5inventory a Beck Depression Inventory and State and Trait Anxiety measures Usedfor demonstrations of correlations regressions graphic displays

iq 14 multiple choice ability items were included as part of the Synthetic Aperture Person-ality Assessment (SAPA) web based personality assessment project The data from1000 subjects are included here as a demonstration set for scoring multiple choiceinventories and doing basic item statistics

galton Two of the earliest examples of the correlation coefficient were Francis Galtonrsquosdata sets on the relationship between mid parent and child height and the similarity ofparent generation peas with child peas galton is the data set for the Galton heightpeas is the data set Francis Galton used to ntroduce the correlation coefficient withan analysis of the similarities of the parent and child generation of 700 sweet peas

Dwyer Dwyer (1937) introduced a method for factor extension (see faextension thatfinds loadings on factors from an original data set for additional (extended) variablesThis data set includes his example

miscellaneous cities is a matrix of airline distances between 11 US cities and maybe used for demonstrating multiple dimensional scaling vegetables is a classicdata set for demonstrating Thurstonian scaling and is the preference matrix of 9vegetables from Guilford (1954) Used by Guilford (1954) Nunnally (1967) Nunnallyand Bernstein (1984) this data set allows for examples of basic scaling techniques

113

14 Development version and a users guide

The most recent development version is available as a source file at the repository main-tained at httppersonality-projectorgr That version will have removed the mostrecently discovered bugs (but perhaps introduced other yet to be discovered ones) Todownload that version go to the repository httppersonality-projectorgrsrc

contrib and wander around For a Mac this version can be installed directly using theldquoother repositoryrdquo option in the package installer For a PC the zip file for the most recentrelease has been created using the win-builder facility at CRAN The development releasefor the Mac is usually several weeks ahead of the PC development version

Although the individual help pages for the psych package are available as part of R andmay be accessed directly (eg psych) the full manual for the psych package is alsoavailable as a pdf at httppersonality-projectorgrpsych_manualpdf

News and a history of changes are available in the NEWS and CHANGES files in the sourcefiles To view the most recent news

gt news(Version gt 150package=psych)

15 Psychometric Theory

The psych package has been developed to help psychologists do basic research Many ofthe functions were developed to supplement a book (httppersonality-projectorgrbook An introduction to Psychometric Theory with Applications in R (Revelle prep)More information about the use of some of the functions may be found in the book

For more extensive discussion of the use of psych in particular and R in general consulthttppersonality-projectorgrrguidehtml A short guide to R

16 SessionInfo

This document was prepared using the following settings

gt sessionInfo()

R Under development (unstable) (2017-01-04 r71889)

Platform x86_64-apple-darwin1340 (64-bit)

Running under macOS Sierra 10122

locale

[1] C

attached base packages

[1] stats graphics grDevices utils datasets methods base

other attached packages

114

[1] GPArotation_201411-1 psych_1612

loaded via a namespace (and not attached)

[1] Rcpp_0124 lattice_020-34 matrixcalc_10-3 MASS_73-45

[5] grid_340 arm_18-6 nlme_31-128 stats4_340

[9] coda_018-1 minqa_124 mi_10 nloptr_104

[13] Matrix_12-71 boot_13-18 splines_340 lme4_11-12

[17] sem_31-7 tools_340 foreign_08-67 abind_14-3

[21] parallel_340 compiler_340 mnormt_15-4

115

References

Bechtoldt H (1961) An empirical study of the factor analysis stability hypothesis Psy-chometrika 26(4)405ndash432

Blashfield R K (1980) The growth of cluster analysis Tryon Ward and JohnsonMultivariate Behavioral Research 15(4)439 ndash 458

Blashfield R K and Aldenderfer M S (1988) The methods and problems of clusteranalysis In Nesselroade J R and Cattell R B editors Handbook of multivariateexperimental psychology (2nd ed) pages 447ndash473 Plenum Press New York NY

Bliese P D (2009) Multilevel modeling in r (23) a brief introduction to r the multilevelpackage and the nlme package

Cattell R B (1966) The scree test for the number of factors Multivariate BehavioralResearch 1(2)245ndash276

Cattell R B (1978) The scientific use of factor analysis Plenum Press New York

Cohen J (1982) Set correlation as a general multivariate data-analytic method Multi-variate Behavioral Research 17(3)

Cohen J Cohen P West S G and Aiken L S (2003) Applied multiple regres-sioncorrelation analysis for the behavioral sciences L Erlbaum Associates MahwahNJ 3rd ed edition

Cooksey R and Soutar G (2006) Coefficient beta and hierarchical item clustering - ananalytical procedure for establishing and displaying the dimensionality and homogeneityof summated scales Organizational Research Methods 978ndash98

Cronbach L J (1951) Coefficient alpha and the internal structure of tests Psychometrika16297ndash334

Dwyer P S (1937) The determination of the factor loadings of a given test from theknown factor loadings of other tests Psychometrika 2(3)173ndash178

Everitt B (1974) Cluster analysis John Wiley amp Sons Cluster analysis 122 pp OxfordEngland

Fox J Nie Z and Byrnes J (2012) sem Structural Equation Models

Grice J W (2001) Computing and evaluating factor scores Psychological Methods6(4)430ndash450

Guilford J P (1954) Psychometric Methods McGraw-Hill New York 2nd edition

116

Guttman L (1945) A basis for analyzing test-retest reliability Psychometrika 10(4)255ndash282

Hartigan J A (1975) Clustering Algorithms John Wiley amp Sons Inc New York NYUSA

Henry D B Tolan P H and Gorman-Smith D (2005) Cluster analysis in familypsychology research Journal of Family Psychology 19(1)121ndash132

Holm S (1979) A simple sequentially rejective multiple test procedure ScandinavianJournal of Statistics 6(2)pp 65ndash70

Holzinger K and Swineford F (1937) The bi-factor method Psychometrika 2(1)41ndash54

Horn J (1965) A rationale and test for the number of factors in factor analysis Psy-chometrika 30(2)179ndash185

Horn J L and Engstrom R (1979) Cattellrsquos scree test in relation to bartlettrsquos chi-squaretest and other observations on the number of factors problem Multivariate BehavioralResearch 14(3)283ndash300

Jennrich R and Bentler P (2011) Exploratory bi-factor analysis Psychometrika pages1ndash13 101007s11336-011-9218-4

Jensen A R and Weng L-J (1994) What is a good g Intelligence 18(3)231ndash258

Loevinger J Gleser G and DuBois P (1953) Maximizing the discriminating power ofa multiple-score test Psychometrika 18(4)309ndash317

MacCallum R C Browne M W and Cai L (2007) Factor analysis models as ap-proximations In Cudeck R and MacCallum R C editors Factor analysis at 100Historical developments and future directions pages 153ndash175 Lawrence Erlbaum Asso-ciates Publishers Mahwah NJ

Martinent G and Ferrand C (2007) A cluster analysis of precompetitive anxiety Re-lationship with perfectionism and trait anxiety Personality and Individual Differences43(7)1676ndash1686

McDonald R P (1999) Test theory A unified treatment L Erlbaum Associates MahwahNJ

Mun E Y von Eye A Bates M E and Vaschillo E G (2008) Finding groupsusing model-based cluster analysis Heterogeneous emotional self-regulatory processesand heavy alcohol use risk Developmental Psychology 44(2)481ndash495

Nunnally J C (1967) Psychometric theory McGraw-Hill New York

117

Nunnally J C and Bernstein I H (1984) Psychometric theory McGraw-Hill New Yorkrdquo

3rd edition

Pedhazur E (1997) Multiple regression in behavioral research explanation and predictionHarcourt Brace College Publishers

Preacher K J and Hayes A F (2004) SPSS and SAS procedures for estimating in-direct effects in simple mediation models Behavior Research Methods Instruments ampComputers 36(4)717ndash731

Revelle W (1979) Hierarchical cluster-analysis and the internal structure of tests Mul-tivariate Behavioral Research 14(1)57ndash74

Revelle W (2015) psych Procedures for Personality and Psychological Research North-western University Evanston R package version 158

Revelle W (in prep) An introduction to psychometric theory with applications in RSpringer

Revelle W and Condon D M (2014) Reliability In Irwing P Booth T and HughesD editors Wiley-Blackwell Handbook of Psychometric Testing Wiley-Blackwell (inpress)

Revelle W Condon D and Wilt J (2011) Methodological advances in differentialpsychology In Chamorro-Premuzic T Furnham A and von Stumm S editorsHandbook of Individual Differences chapter 2 pages 39ndash73 Wiley-Blackwell

Revelle W and Rocklin T (1979) Very Simple Structure - alternative procedure forestimating the optimal number of interpretable factors Multivariate Behavioral Research14(4)403ndash414

Revelle W Wilt J and Rosenthal A (2010) Personality and cognition The personality-cognition link In Gruszka A Matthews G and Szymura B editors Handbook ofIndividual Differences in Cognition Attention Memory and Executive Control chap-ter 2 pages 27ndash49 Springer

Revelle W and Zinbarg R E (2009) Coefficients alpha beta omega and the glbcomments on Sijtsma Psychometrika 74(1)145ndash154

Schmid J J and Leiman J M (1957) The development of hierarchical factor solutionsPsychometrika 22(1)83ndash90

Shrout P E and Fleiss J L (1979) Intraclass correlations Uses in assessing raterreliability Psychological Bulletin 86(2)420ndash428

Smillie L D Cooper A Wilt J and Revelle W (2012) Do extraverts get more bang

118

for the buck refining the affective-reactivity hypothesis of extraversion Journal ofPersonality and Social Psychology 103(2)306ndash326

Sneath P H A and Sokal R R (1973) Numerical taxonomy the principles and practiceof numerical classification A Series of books in biology W H Freeman San Francisco

Sokal R R and Sneath P H A (1963) Principles of numerical taxonomy A Series ofbooks in biology W H Freeman San Francisco

Spearman C (1904) The proof and measurement of association between two things TheAmerican Journal of Psychology 15(1)72ndash101

Steiger J H (1980) Tests for comparing elements of a correlation matrix PsychologicalBulletin 87(2)245ndash251

Thorburn W M (1918) The myth of occamrsquos razor Mind 27345ndash353

Thurstone L L and Thurstone T G (1941) Factorial studies of intelligence TheUniversity of Chicago press Chicago Ill

Tryon R C (1935) A theory of psychological componentsndashan alternative to rdquomathematicalfactorsrdquo Psychological Review 42(5)425ndash454

Tryon R C (1939) Cluster analysis Edwards Brothers Ann Arbor Michigan

Velicer W (1976) Determining the number of components from the matrix of partialcorrelations Psychometrika 41(3)321ndash327

Zinbarg R E Revelle W Yovel I and Li W (2005) Cronbachrsquos α Revellersquos β andMcDonaldrsquos ωH) Their relations with each other and two alternative conceptualizationsof reliability Psychometrika 70(1)123ndash133

Zinbarg R E Yovel I Revelle W and McDonald R P (2006) Estimating gener-alizability to a latent variable common to all of a scalersquos indicators A comparison ofestimators for ωh Applied Psychological Measurement 30(2)121ndash144

119

Index

affect 13 22agnes 52alpha 5 6 71ndash74 87

Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 6 42 52biquartimin 42biserial 12 31blockrandom 111burt 31

circadiancor 111circadianlinearcor 111circadianmean 111circular statistics 111cities 113cluster analysis 39 52cluster scores 39clustercor 71 81 83clusterloadings 81 85cohenkappa 112common variance 48component scores 39congruence coefficient 55cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111ctv 7

cubits 112

densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109diagram 7drawcor 31drawtetra 31dummycode 12dynamite plot 17

edit 4epibfi 112eRm 95error bars 17errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68

FA 39 48fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41

120

factor analysis 6 39 48factor extension 66factor scores 39factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40factors 39fisherz 111

galton 113generalized least squares 6generalized least squares factor analysis 39geometricmean 112GPArotation 7 42 52guttman 6 76 80

harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6hierarchical cluster analysis 52Hmisc 25Holzinger 112

ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109Index 111introduction to psychometric theory with ap-

plications in R 7iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87

keys 73

KnitR 109

lavaan 66 104library 8lm 31logistic 91lowerCor 24lowerMat 24lowerUpper 24lowess 13ltm 95

makekeys 82MAP 6 40 62mardia 112maximum likelihood 6maximum likelihood factor analysis 39mediate 36 37mediatediagram 36minres 41minimum residual 6minres factor analysis 39mixedcor 31multihist 6multilevel 103multiple regression 31

nfactors 6 40nlme 103

oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11

padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83parallels 95paran 64partialr 112

121

PCA 39 48peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 64 90polyserial 31principal 5ndash7 40 42 48 75principal axis 6principal axis factor analysis 39principal components 39principal components analysis 39 48princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 4ndash6 8 25 36 39 41 52 82 95 109

111 112 114

quartimax 42quartimin 42

R functionaffect 13agnes 52alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111circadianlinearcor 111

circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112cor 24corsmooth 31cortest 25cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12edit 4epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87 109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64

122

faplot 40 42fapoly 39 55fa2latex 109 111faBy 104factanal 40 41factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112hclust 52head 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87 109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87library 8lm 31lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31

multihist 6nfactors 6 40oblimin 42omega 6 7 48 55 70 71 75 95 109omegaSem 75 79outlier 4 10 11padjust 25prep 112pairs 13pairspanels 4 6 11ndash15 83paran 64partialr 112peas 112 113plot 42 46plotirt 6plotpoly 6plotpsych 109pnorm 108polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75princomp 48print 42printpsych 45progressBar 55Promax 42 48psych 114psych package

affect 13alpha 5 6 71ndash74 87Bechtoldt1 112Bechtoldt2 112bestScales 89bfi 23 81ndash83 89 100 112bibars 6 23bifactor 42 52biquartimin 42biserial 12 31blockrandom 111burt 31circadiancor 111

123

circadianlinearcor 111circadianmean 111cities 113clustercor 71 81 83clusterloadings 81 85cohenkappa 112corsmooth 31cor2 102cor2latex 109 111corPlot 6corrp 25 29corrtest 25 29cortest 30cosinor 111cubits 112densityBy 13describe 6 9 87 111describeBy 9 10df2latex 109 111diarect 109drawcor 31drawtetra 31dummycode 12epibfi 112errorbars 6 13 17errorbarsby 9 13 17 18errorbarstab 17errorcrosses 17errorCircles 22errorCrosses 21esem 68esemdiagram 68fa 6 7 39 41 42 48 49 55 87

109fadiagram 6 41 47 67 109faextension 67 113fagraph 41fairt 95faparallel 4 6 40 62 64 87faparallelpoly 64faplot 40 42

fapoly 39 55fa2latex 109 111faBy 104factorcongruence 40 61factorminres 7 40ndash42factorpa 7 40ndash42 75factorwls 7 40fisherz 111galton 113geometricmean 112guttman 6 76 80harmonicmean 112headtail 112heights 112hetdiagram 6Holzinger 112ICC 6 112iclust 6 41 52 53 59 70 75 87

109iclustdiagram 6 109iqitems 112irtfa 6 90 95 99 109irtnpl 108irtnpn 108irtresponses 87 88irt2latex 109 111itemlookup 87lowerCor 24lowerMat 24lowerUpper 24makekeys 82MAP 6 40 62mardia 112mediate 36 37mediatediagram 36mixedcor 31multihist 6nfactors 6 40omega 6 7 48 55 70 71 75 95

109omegaSem 75 79

124

outlier 4 10 11prep 112pairspanels 4 6 11ndash15 83partialr 112peas 112 113plot 46plotirt 6plotpoly 6plotpsych 109polychoric 6 31 55 90polyserial 31principal 5ndash7 40 42 48 75print 42printpsych 45progressBar 55Promax 42 48psych 114rtest 25rangeCorrection 112readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8Reise 112responsefrequencies 87reversecode 112satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106

sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

quartimax 42quartimin 42rtest 25

125

rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 41satact 9 30 104scatterhist 6schmid 6 7 75 76scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash

87scrub 4 12sector 38setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108

simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107smc 40spider 12splitHalf 80statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 112Thurstone33 112topBottom 112varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87withinBetween 103

R packagectv 7eRm 95GPArotation 7 42 52Hmisc 25KnitR 109lavaan 66 104ltm 95multilevel 103nlme 103parallels 95paran 64psych 4ndash6 8 25 36 39 41 52 82 95

109 111 112 114Rgraphviz 7

126

sem 7 66 75 79 112stats 25 41 48Sweave 109xtable 109

rtest 25rangeCorrection 112rcorr 25readclipboard 6 8readclipboardcsv 8readclipboardfwf 8readclipboardlower 8readclipboardtab 4 8 9readclipboardupper 8readtable 8Reise 112responsefrequencies 87reversecode 112Rgraphviz 7 41rooted dendritic structure 52

SAPA 23 85 95ndash98 113satact 9 30 104scatterhist 6schmid 6 7 75 76Schmid-Leiman 48scoremultiplechoice 6 85 87 95scoreIrt 96 100scoreIrt1pl 100scoreIrt2pl 100scoreItems 5 6 71 74 75 81ndash83 85ndash87scrub 4 12sector 38sem 7 66 75 79 112set correlation 66 104setcor 66 87 112setCor 31 104ndash106sim 107simanova 107simcirc 107simcongeneric 107simdichot 107

simhierarchical 107simirt 107 108simitem 107simminor 62 107 108simmultilevel 103 108simnpl 107 108simnpn 107 108simomega 107 108simparallel 107simrasch 107 108simsimple 107simsimplex 107 108simstructural 107simstructure 62 107 108simvss 107Singular Value Decomposition 39smc 40spider 12splitHalf 80stats 25 41 48statsBy 103 104statsByboot 103statsBybootsummary 103structurediagram 6 109superMatrix 112Sweave 109

table 17tail 112targetrot 42tetrachoric 6 31 55 90 91Thurstone 25 31 112Thurstone33 112topBottom 112tree diagram 52

varimax 42varimin 42vegetables 112 113violinBy 13 16vss 4 6 40 62 63 87

127

weighted least squares 6weighted least squares factor analysis 39withinBetween 103

xtable 109

128

  • Jump starting the psych packagendasha guide for the impatient
  • Overview of this and related documents
  • Getting started
  • Basic data analysis
    • Data input from the clipboard
    • Basic descriptive statistics
      • Outlier detection using outlier
      • Basic data cleaning using scrub
      • Recoding categorical variables into dummy coded variables
        • Simple descriptive graphics
          • Scatter Plot Matrices
          • Density or violin plots
          • Means and error bars
          • Error bars for tabular data
          • Two dimensional displays of means and errors
          • Back to back histograms
          • Correlational structure
          • Heatmap displays of correlational structure
            • Testing correlations
            • Polychoric tetrachoric polyserial and biserial correlations
            • Multiple regression from data or correlation matrices
            • Mediation and Moderation analysis
              • Item and scale analysis
                • Dimension reduction through factor analysis and cluster analysis
                  • Minimum Residual Factor Analysis
                  • Principal Axis Factor Analysis
                  • Weighted Least Squares Factor Analysis
                  • Principal Components analysis (PCA)
                  • Hierarchical and bi-factor solutions
                  • Item Cluster Analysis iclust
                    • Confidence intervals using bootstrapping techniques
                    • Comparing factorcomponentcluster solutions
                    • Determining the number of dimensions to extract
                      • Very Simple Structure
                      • Parallel Analysis
                        • Factor extension
                        • Exploratory Structural Equation Modeling (ESEM)
                          • Classical Test Theory and Reliability
                            • Reliability of a single scale
                            • Using omega to find the reliability of a single scale
                            • Estimating h using Confirmatory Factor Analysis
                              • Other estimates of reliability
                                • Reliability and correlations of multiple scales within an inventory
                                  • Scoring from raw data
                                  • Forming scales from a correlation matrix
                                    • Scoring Multiple Choice Items
                                    • Item analysis
                                      • Exploring the item structure of scales
                                      • Empirical scale construction
                                          • Item Response Theory analysis
                                            • Factor analysis and Item Response Theory
                                            • Speeding up analyses
                                            • IRT based scoring
                                              • 1 versus 2 parameter IRT scoring
                                                  • Multilevel modeling
                                                    • Decomposing data into within and between level correlations using statsBy
                                                    • Generating and displaying multilevel data
                                                    • Factor analysis by groups
                                                      • Set Correlation and Multiple Regression from the correlation matrix
                                                      • Simulation functions
                                                      • Graphical Displays
                                                      • Converting output to APA style tables using LaTeX
                                                      • Miscellaneous functions
                                                      • Data sets
                                                      • Development version and a users guide
                                                      • Psychometric Theory
                                                      • SessionInfo
Page 11: An overview of the psych package - The Personality Project
Page 12: An overview of the psych package - The Personality Project
Page 13: An overview of the psych package - The Personality Project
Page 14: An overview of the psych package - The Personality Project
Page 15: An overview of the psych package - The Personality Project
Page 16: An overview of the psych package - The Personality Project
Page 17: An overview of the psych package - The Personality Project
Page 18: An overview of the psych package - The Personality Project
Page 19: An overview of the psych package - The Personality Project
Page 20: An overview of the psych package - The Personality Project
Page 21: An overview of the psych package - The Personality Project
Page 22: An overview of the psych package - The Personality Project
Page 23: An overview of the psych package - The Personality Project
Page 24: An overview of the psych package - The Personality Project
Page 25: An overview of the psych package - The Personality Project
Page 26: An overview of the psych package - The Personality Project
Page 27: An overview of the psych package - The Personality Project
Page 28: An overview of the psych package - The Personality Project
Page 29: An overview of the psych package - The Personality Project
Page 30: An overview of the psych package - The Personality Project
Page 31: An overview of the psych package - The Personality Project
Page 32: An overview of the psych package - The Personality Project
Page 33: An overview of the psych package - The Personality Project
Page 34: An overview of the psych package - The Personality Project
Page 35: An overview of the psych package - The Personality Project
Page 36: An overview of the psych package - The Personality Project
Page 37: An overview of the psych package - The Personality Project
Page 38: An overview of the psych package - The Personality Project
Page 39: An overview of the psych package - The Personality Project
Page 40: An overview of the psych package - The Personality Project
Page 41: An overview of the psych package - The Personality Project
Page 42: An overview of the psych package - The Personality Project
Page 43: An overview of the psych package - The Personality Project
Page 44: An overview of the psych package - The Personality Project
Page 45: An overview of the psych package - The Personality Project
Page 46: An overview of the psych package - The Personality Project
Page 47: An overview of the psych package - The Personality Project
Page 48: An overview of the psych package - The Personality Project
Page 49: An overview of the psych package - The Personality Project
Page 50: An overview of the psych package - The Personality Project
Page 51: An overview of the psych package - The Personality Project
Page 52: An overview of the psych package - The Personality Project
Page 53: An overview of the psych package - The Personality Project
Page 54: An overview of the psych package - The Personality Project
Page 55: An overview of the psych package - The Personality Project
Page 56: An overview of the psych package - The Personality Project
Page 57: An overview of the psych package - The Personality Project
Page 58: An overview of the psych package - The Personality Project
Page 59: An overview of the psych package - The Personality Project
Page 60: An overview of the psych package - The Personality Project
Page 61: An overview of the psych package - The Personality Project
Page 62: An overview of the psych package - The Personality Project
Page 63: An overview of the psych package - The Personality Project
Page 64: An overview of the psych package - The Personality Project
Page 65: An overview of the psych package - The Personality Project
Page 66: An overview of the psych package - The Personality Project
Page 67: An overview of the psych package - The Personality Project
Page 68: An overview of the psych package - The Personality Project
Page 69: An overview of the psych package - The Personality Project
Page 70: An overview of the psych package - The Personality Project
Page 71: An overview of the psych package - The Personality Project
Page 72: An overview of the psych package - The Personality Project
Page 73: An overview of the psych package - The Personality Project
Page 74: An overview of the psych package - The Personality Project
Page 75: An overview of the psych package - The Personality Project
Page 76: An overview of the psych package - The Personality Project
Page 77: An overview of the psych package - The Personality Project
Page 78: An overview of the psych package - The Personality Project
Page 79: An overview of the psych package - The Personality Project
Page 80: An overview of the psych package - The Personality Project
Page 81: An overview of the psych package - The Personality Project
Page 82: An overview of the psych package - The Personality Project
Page 83: An overview of the psych package - The Personality Project
Page 84: An overview of the psych package - The Personality Project
Page 85: An overview of the psych package - The Personality Project
Page 86: An overview of the psych package - The Personality Project
Page 87: An overview of the psych package - The Personality Project
Page 88: An overview of the psych package - The Personality Project
Page 89: An overview of the psych package - The Personality Project
Page 90: An overview of the psych package - The Personality Project
Page 91: An overview of the psych package - The Personality Project
Page 92: An overview of the psych package - The Personality Project
Page 93: An overview of the psych package - The Personality Project
Page 94: An overview of the psych package - The Personality Project
Page 95: An overview of the psych package - The Personality Project
Page 96: An overview of the psych package - The Personality Project
Page 97: An overview of the psych package - The Personality Project
Page 98: An overview of the psych package - The Personality Project
Page 99: An overview of the psych package - The Personality Project
Page 100: An overview of the psych package - The Personality Project
Page 101: An overview of the psych package - The Personality Project
Page 102: An overview of the psych package - The Personality Project
Page 103: An overview of the psych package - The Personality Project
Page 104: An overview of the psych package - The Personality Project
Page 105: An overview of the psych package - The Personality Project
Page 106: An overview of the psych package - The Personality Project
Page 107: An overview of the psych package - The Personality Project
Page 108: An overview of the psych package - The Personality Project
Page 109: An overview of the psych package - The Personality Project
Page 110: An overview of the psych package - The Personality Project
Page 111: An overview of the psych package - The Personality Project
Page 112: An overview of the psych package - The Personality Project
Page 113: An overview of the psych package - The Personality Project
Page 114: An overview of the psych package - The Personality Project
Page 115: An overview of the psych package - The Personality Project
Page 116: An overview of the psych package - The Personality Project
Page 117: An overview of the psych package - The Personality Project
Page 118: An overview of the psych package - The Personality Project
Page 119: An overview of the psych package - The Personality Project
Page 120: An overview of the psych package - The Personality Project
Page 121: An overview of the psych package - The Personality Project
Page 122: An overview of the psych package - The Personality Project
Page 123: An overview of the psych package - The Personality Project
Page 124: An overview of the psych package - The Personality Project
Page 125: An overview of the psych package - The Personality Project
Page 126: An overview of the psych package - The Personality Project
Page 127: An overview of the psych package - The Personality Project
Page 128: An overview of the psych package - The Personality Project