applied psychological measurement …math.ntnu.edu.tw/~rtsai/106/irt/ref/mirt_comparison.pdfkyung...

14
Article Applied Psychological Measurement 2014, Vol. 38(6) 486–498 Ó The Author(s) 2014 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/0146621614536770 apm.sagepub.com A Review of Commercial Software Packages for Multidimensional IRT Modeling Kyung (Chris) T. Han 1 and Insu Paek 2 Abstract In this study, the authors evaluate several commercially available multidimensional item response theory (MIRT) software packages, including IRTPRO 2.1, Mplus 7.1, FlexMIRT, and EQSIRT, as well as their built-in estimation algorithms, and compare them for their performance in MIRT model estimation. The study examines the performance of model parameter recovery via a series of simulations based on four approaches for latent structuring—within-item MIRT, between-item MIRT, a mixture of within- and between-item MIRT and a bifactor model. The simulation studies focused on realistic conditions and models that researchers and practitioners are likely to encounter in practice. The results showed that the studied software packages recovered the item parameters reasonably well but differed greatly in terms of the types of data and models they could handle and also the run time required for estimation completion. Keywords MIRT, computer program, simulation The emergence of item response theory (IRT), originally introduced in the 1960s and based mainly on a unidimensional latent structure (Lord & Novick, 1968), completely changed the paradigm of psychological and educational measurement. It made many things possible, com- puterized adaptive testing, for example, that were not as effective (if not nearly impossible) using classical test theory. One of the main reasons for IRT’s rapid gain in popularity in the field was the availability of computer software tools for estimating various IRT models. This development enabled and invited a wide range of IRT research and applications. With the introduction of early unidimensional IRT models to the field, models generalized for multidimensional latent structures soon followed. The concept of multidimensional item response theory (MIRT), the mathematical equivalent to existing factor analysis approaches, 1 Graduate Management Admission Council, Reston, VA, USA 2 Florida State University, Tallahassee, FL, USA Corresponding Author: Kyung (Chris) T. Han, Graduate Management Admission Council, 11921 Freedom Dr., Suite 300, Reston, VA 20190, USA. Email: [email protected] at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016 apm.sagepub.com Downloaded from at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016 apm.sagepub.com Downloaded from at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016 apm.sagepub.com Downloaded from

Upload: hoangkien

Post on 22-Apr-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Article

Applied Psychological Measurement2014, Vol. 38(6) 486–498

� The Author(s) 2014Reprints and permissions:

sagepub.com/journalsPermissions.navDOI: 10.1177/0146621614536770

apm.sagepub.com

A Review of CommercialSoftware Packages forMultidimensional IRTModeling

Kyung (Chris) T. Han1 and Insu Paek2

Abstract

In this study, the authors evaluate several commercially available multidimensional item responsetheory (MIRT) software packages, including IRTPRO 2.1, Mplus 7.1, FlexMIRT, and EQSIRT, aswell as their built-in estimation algorithms, and compare them for their performance in MIRTmodel estimation. The study examines the performance of model parameter recovery via aseries of simulations based on four approaches for latent structuring—within-item MIRT,between-item MIRT, a mixture of within- and between-item MIRT and a bifactor model. Thesimulation studies focused on realistic conditions and models that researchers and practitionersare likely to encounter in practice. The results showed that the studied software packagesrecovered the item parameters reasonably well but differed greatly in terms of the types of dataand models they could handle and also the run time required for estimation completion.

Keywords

MIRT, computer program, simulation

The emergence of item response theory (IRT), originally introduced in the 1960s and based

mainly on a unidimensional latent structure (Lord & Novick, 1968), completely changed the

paradigm of psychological and educational measurement. It made many things possible, com-

puterized adaptive testing, for example, that were not as effective (if not nearly impossible)

using classical test theory. One of the main reasons for IRT’s rapid gain in popularity in the

field was the availability of computer software tools for estimating various IRT models. This

development enabled and invited a wide range of IRT research and applications.

With the introduction of early unidimensional IRT models to the field, models generalized

for multidimensional latent structures soon followed. The concept of multidimensional item

response theory (MIRT), the mathematical equivalent to existing factor analysis approaches,

1Graduate Management Admission Council, Reston, VA, USA2Florida State University, Tallahassee, FL, USA

Corresponding Author:

Kyung (Chris) T. Han, Graduate Management Admission Council, 11921 Freedom Dr., Suite 300, Reston, VA 20190,

USA.

Email: [email protected]

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

had circulated in the field for decades but had yet to reach critical mass in terms of the number

of related research studies and applications. This was largely due to unavailability of computer

software tools for MIRT modeling and insufficient PC computing power. Typical response data

in psychological or educational measurement are of a discrete scale that requires mathematical

integration processes causing heavy computational loads, especially when dealing with three or

more dimensions. The last 5 years, however, have witnessed great progress in the development

of both computer hardware and software. It is now common for PCs to sport multiple CPU

cores, and equally common for these newer systems to implement multiple parallel threads with

a virtualization technique for computation-heavy tasks. Most important is the development of

several new software tools for MIRT modeling designed to take advantage of the latest PCs that

are now readily available to the public. Because these MIRT software tools differ significantly

in terms of their performance and user experience, researchers and practitioners could benefit

from a comprehensive comparison of these tools to make more informed decisions in choosing

the right MIRT software tool for their specific needs.

This study presents an evaluation and comparison of the most recent commercially available

MIRT software packages and their estimation algorithms. These packages include IRTPRO 2.1

(Cai, Thissen, & du Toit, 2011), Mplus 7.1 (Muthen & Muthen, 1998-2012), flexMIRT 2 (Cai,

2013), and EQSIRT (Wu & Bentler, 2013).

Features, Capabilities, and Algorithms

Mplus 7.1

Mplus supports a variety of statistical analysis methods including regression and path analysis,

exploratory factor analysis (EFA), confirmatory factor analysis (CFA), structural equation mod-

eling (SEM), and mixture modeling. It also supports multigroup and multilevel data. Mplus

handles IRT and MIRT models as a special case of CFA and can estimate threshold and slope

parameters but not lower asymptote (i.e., pseudo-guessing parameter). In other words, it does

not support three-parameter logistic (3PL) IRT/MIRT models. Mplus can handle both dichoto-

mous and/or polytomous items and also supports mixture and/or multilevel IRT analyses.

For MIRT modeling, Mplus offers great flexibility in specifying latent structures and con-

straints. Users are allowed to impose constrains on a covariance matrix for latent variables and

on item parameters. Mplus can handle data with missing responses. Mplus can handle a maxi-

mum of 500 variables (i.e., items), which might be too small for some large-scale tests with

large item banks.

For MIRT estimation, available estimators includes the weighted least square means and var-

iance adjusted (WLSMV), the maximum likelihood (ML), and Bayes. For the ML estimator,

which is the most widely used, Mplus features several different sets of algorithms including

Fisher scoring, Newton–Raphson, quasi-Newton, and Expectation–Maximization (EM). By

default, Mplus automatically selects and implements different algorithms for ML during the

iterations. Users can choose one of three options for numerical integration: (a) rectangular inte-

gration, (b) Gauss–Hermite integration, and (c) Monte Carlo (MC) integration. The developers

recommend use of rectangular integration whenever possible, but Monte Carlo integration is

sometimes the only feasible option when the estimation involves more than three dimensions

due to what is known as the ‘‘curse of dimensionality.’’ For estimating person score (i.e., factor

score), it uses the expected a posteriori (EAP) estimation method. Monte Carlo simulations and

multiple imputations can be done within the software. Mplus supports multithreading for the

ML estimator, and it can reduce the time required to complete estimation by a large factor when

the computer has multiple CPU cores available.

Han and Paek 487

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

IRTPRO 2.1

IRTPRO 2.1, although still offering some features for EFA, mainly focuses on IRT/MIRT mod-

eling. IRTPRO is based on a highly generalized IRT model that embraces multiple groups, mul-

tiple response categories, and multiple dimensions. As a result, it can cover a range of models,

from a unidimensional 3PL model with a pseudo-guessing parameter to multidimensional IRT

models for multiple groups with polytomous responses based on the graded response model

(GRM), the generalized partial credit model (GPCM), or the nominal response model (NRM).

Users can impose constraints or prior values on any of the item parameters with IRTPRO,

and the software can handle data with missing responses. There is no maximum limit on the

number of items that IRTPRO can handle; however, it is essentially limited by the system’s

usable memory and by IRTPRO’s maximum manageable memory with a 32-bit architecture

(theoretically up to 4 GB).

For item parameter estimation, IRTPRO offers three different methods: the Bock–Aitkin

approach with the expectation–maximization algorithm (BAEM; Bock & Aitkin, 1981), the

adaptive quadrature approach (ADQ; Schilling & Bock, 2005) with three different options for

numerical integrations (Gauss–Hermite, Monte Carlo, and Latin Hypercube), and the

Metropolis–Hastings Robbins–Monro Method (MHRM; Cai, 2010a, 2010b). MHRM is an

optimization method that produces marginal maximum likelihood (MML) and modal Bayes

solutions for item factor models and multilevel item factor model. It eschews numerical inte-

gration and combines elements of the Makov chain Monte Carlo (MCMC) using the

Metropolis–Hasting method with stochastic approximation (the Robbins–Monro method) to

achieve a pointwise convergent algorithm. It produces standard errors as a by-product.

Generally, the only tuning parameter for the MHRM algorithm is the proposal dispersion

constant for the MH sampler. For score estimation, IRTPRO supports three types of estima-

tion methods: EAP, summed score expected a posteriori (SSEAP), and the maximum a pos-

teriori (MAP) method and also provides differential item functioning (DIF) analysis tools.

IRTPRO supports the multithreading, boosting its computation speed on computers that have

multiple CPU cores.

flexMIRT 2.0

flexMIRT supports the same unidimensional and multidimensional IRT models that IRTPRO

does and also is capable of handling multilevel structures; the latest version (2.0) also supports

the extended cognitive diagnostic models (CDM). There are two different editions of flexMIRT

that work with different Windows systems—one based on a 32-bit architecture and another

based on a 64-bit architecture. For the 64-bit version, there is practically no maximum limit in

the size of computer memory that flexMIRT can use (Windows 8 Pro supports up to 512 GB

RAM).

For item calibration, users can choose either BAEM or MHRM methods, and, for score esti-

mation, they have the option of using ML, EAP, SSEAP, Weighted SSEAP, or MAP. flexMIRT

also can estimate the item parameters and theta distribution simultaneously by using empirical

histogram (currently only for single level bifactor or testlet response models). Tools for DIF

analyses, multiple imputation, and simulations are also available within flexMIRT. A unique

feature of flexMIRT is the ‘‘fixed effects calibration,’’ in which item parameters can be esti-

mated given fixed values for individual thetas. This feature can be very useful for practitioners

who need to calibrate pretest items using a priori calibrated individual theta values. flexMIRT

also supports multithreading operations with two possible options for efficiency tuning depend-

ing on the number of dimensions and the number of items.

488 Applied Psychological Measurement 38(6)

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

EQSIRT 1.0

EQSIRT supports various unidimensional and multidimensional IRT models with both dichoto-

mous and polytomous response types as well as latent class analysis (LCA) and Mokken scale

analysis (MSA). EQSIRT offers three estimation algorithms for item calibration: (a) the MML

method, (b) the Monte Carlo Expectation–Maximization (MCEM) method, and (c) the MCMC

method. Not all estimation methods work with all IRT models, however. For example, MCEM

and MCMC do not work for estimating the 3PL model in EQSIRT. For theta estimation, the

software offers ML, EAP, and MAP methods.

EQSIRT offers a wide variety of tools for IRT-related research including DIF analysis, test

score equating, and simulation. EQSIRT cannot handle a response matrix with missing data,

and does not support multithreading computing. EQSIRT can handle a maximum number of

200 items for MIRT calibration.

Performance

A series of simulation studies were conducted to evaluate and compare the performance of

MIRT parameter recovery across the studied software packages.

Simulations

Response matrix data with 30 dichotomous items loading on three or four factors (i.e., latent

variables) were simulated based on the multidimensional compensatory 2PL IRT model under

four different latent structures. The diagrams of the four models are represented in Figure 1.

Model 1 was a so-called ‘between-item’ structure, in which each item loaded on a single fac-

tor. There were three factors, and each factor was exclusively associated with 10 items. The

true item parameter values were borrowed from a real, existing item bank for a testing program

in higher education. The mean and the standard deviation (SD) of the true slope parameter val-

ues of the 30 items were 1.22 and 0.48, respectively. To prevent potential scaling issues, all true

threshold parameter values were rescaled to follow a standard normal distribution.

Model 2 was a so-called ‘within-item’ structure in this study, where some items loaded on

more than a single factor. In Model 2, the first 10 items loaded only on the first factor, F1, and

Items 11 to 20 loaded on both F1 and F2. The final 10 items (Items 21-30) loaded on all three

factors (F1, F2, and F3). The mean and the SD of the true a-parameter values across factors were

0.88 and 0.47, respectively, and the true b-parameter values were the same as those seen with

Model 1.

Model 3 was a mixture of ‘‘between’’ and ‘‘within’’ structures. Items 1 to 10 loaded only on

F1, Items 11 to 15 only on F2, and Items 21 to 25 only on F3. The remaining items loaded on

multiple factors.

Model 4 was a bifactor model in which there was one primary factor (F1) on which all 30

items loaded. Each item also loaded on one of three nuisance factors (F2, F3, and F4).

Under the four studied models, the response data for 3,000 simulees were generated. For

Models 1, 2, and 3, the true theta values with three dimensions were borrowed from real score

data derived from a higher education testing program with three different subjects. The true

theta values of each dimension were rescaled to follow a standard normal distribution, and as a

result, the true covariance matrix values were identical to the correlation matrix values. The

covariance values were 0.19 for F1 and F2, 0.46 for F1 and F3, and 0.35 for F2 and F3. For

Model 4, the true theta values for four factors (one main + three nuisances) were randomly

Han and Paek 489

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

generated from a standard normal distribution, and in the covariance matrix, all diagonal values

were 1, and all off-diagonal values were 0.

In addition to the aforementioned study conditions, Models 1, 2, and 4, were also replicated

with missing response data. For the missing data conditions, the same 3,000 simulee data were

used. Each simulee was administered with 30 items randomly selected from a total of 90 items.

The simulation was performed using WinGen software (Han, 2007).

Calibration

A majority of available estimation methods were used with the Mplus, IRTPRO, flexMIRT, and

EQSIRT software. For Mplus, the ML and MC methods were used. With IRTPRO, the BAEM,

ADQ, and MHRM methods were used, and the BAEM and MHRM methods were employed

with flexMIRT. The MML, MCEM, and MCMC options were used with EQSIRT. All item

parameters (slope and threshold) that loaded on corresponding factors according to each model

design (Figure 1) were set to be freely estimated. The slope parameters of items that did not load

on factors were fixed at zero. To avoid the scale indeterminacy issue, the mean values of latent

Figure 1. Factor structure and items for simulations.Note. IRT = item response theory.

490 Applied Psychological Measurement 38(6)

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

trait distributions were set to 0, and the variance values of the latent trait distributions were fixed

at 1. For Models 1, 2, and 3, all off-diagonal values of the covariance matrix for the latent traits

were set to be freely estimated. For Model 2, the calibration was done twice under an additional

condition where all off-diagonal values of the covariance matrix were fixed to the true values

(called it Model 2B). For Model 4, all off-diagonal values of the covariance matrix were fixed

to zero during calibration, and for IRTPRO and flexMIRT, the commands for bifactor modeling

were used, where dimension reduction techniques were employed.

The calibration performance of the studied software programs was evaluated based on

Pearson correlation coefficients between the true values and estimates, estimation bias, and

mean absolute error (MAE) statistics. For models where the covariance matrix was freely esti-

mated, the difference between the true values and estimates also were compared. All calibra-

tions were run on a computer with Intel i7-2760QM CPU with quad cores and 8 GB of physical

RAM. The operating system was Microsoft Windows 7 Professional 64-bit edition. Except for

EQSIRT, which did not support multithread computing, all studied programs were set to use up

to two CPU cores during calibration. The actual elapsed time (not CPU time) for each calibra-

tion run was recorded. The study focused on evaluating item parameter recovery and did not

evaluate the person estimates.

Results and Performance Comparison

Conditions with 30 items without missing responses were evaluated first. Except for the MCEM

and MCMC methods with EQSIRT, all studied software programs with different estimation

methods showed an extremely high level of parameter recovery performance for both slope and

threshold parameters under Models 1, 2B, 3, and 4. The correlation between the true and esti-

mated parameter values was higher than 0.96 for slope and higher than 0.99 for threshold.

EQSIRT with the MCEM and MCMC methods failed to finish calibration runs for all studied

Models.1 As shown in Table 1, the estimation bias and errors (measured using MAE) were very

small for Models 1, 2B, and 3. The off-diagonal values of the covariance matrix were reason-

ably well recovered for Models 1 and 3.

Model 2 (the within-item design) was technically a hierarchical factor model with three

layers, and the latent covariances were expected to be unidentifiable under the studied condi-

tions. Contrary to the expectation, however, all studied software tools finished calibration

runs without encountering any errors when the latent covariance matrix was freely estimated

(Model 2A). As reported in Table 1, however, the estimated latent covariance values from all

calibration runs were incorrect because of a rotational indeterminacy. In these circumstances,

the software tools should have either stopped running without producing outputs with wrong

estimates for unidentifiable parameters or, at the very least, provided error or warning messages

so that users could avoid being misled by the meaningless estimates for unidentifiable param-

eters. None of the studied programs, however, provided such a feature or guidance. The accu-

racy of slope parameter estimates under Model 2A across the studied programs was moderately

degraded from that observed for Model 1. This was due mainly to increased estimation biases

resulting from covariance estimates were incorrect. The parameter estimates for threshold under

Model 2A was still very close to the true values for all studied programs. Under Model 2B,

where all covariance matrix values were fixed to the true values, all programs except EQSIRT

showed an extremely high level of parameter recovery performance for both the slope and the

threshold. EQSIRT offers a syntax command to fix latent covariance matrix values, but it

turned out to be not working; the program recognized the command but ignored it. As a result,

the latent covariance matrix values for Model 2B were not fixed but freely estimated with

EQSIRT as in Model 2A.2

Han and Paek 491

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Tab

le1.

Item

Par

amet

erR

ecove

ryW

hen

Without

Mis

sing

Dat

a(3

0It

ems/

3,0

00

Sim

ule

es).

Corr

elat

ion

Bia

sM

AE

Cova

rian

cem

atri

x

Slope

Thre

shold

Slope

Thre

shold

Slope

Thre

shold

Cov(

a1,a

2),

True

=0.1

9C

ov(

a1,a

3),

True

=0.4

6C

ov(

a2,a

2),

True

=0.3

5

Model

1(b

etw

een)

Mplu

s(M

L).9

86

.998

0.0

41

0.0

16

0.0

70

0.0

54

0.2

17

0.4

56

0.3

44

Mplu

s(M

C)

.986

.998

0.0

48

0.0

17

0.0

73

0.0

54

0.2

27

0.4

73

0.3

51

EQ

SIRT

(ML)

.986

.998

0.0

41

0.0

16

0.0

70

0.0

54

0.2

17

0.4

57

0.3

44

IRT

PRO

(BA

EM

).9

86

.998

0.0

41

0.0

16

0.0

70

0.0

54

0.2

17

0.4

57

0.3

44

IRT

PRO

(AD

Q)

.986

.998

0.0

41

0.0

14

0.0

70

0.0

53

0.2

17

0.4

57

0.3

44

IRT

PRO

(MH

RM

).9

86

.998

0.0

39

0.0

21

0.0

70

0.0

54

0.2

21

0.4

61

0.3

56

Flex

MIR

T(B

AEM

).9

86

.998

0.0

41

0.0

16

0.0

70

0.0

54

0.2

65

0.4

76

0.3

67

Flex

MIR

T(M

HR

M)

.986

.998

0.0

38

0.0

17

0.0

69

0.0

55

0.2

13

0.4

32

0.3

30

Model

2A

a(w

ithin

)M

plu

s(M

L).9

13

.998

0.1

57

0.0

41

0.1

96

0.0

67

20.0

23

0.1

60

20.2

25

Mplu

s(M

C)

.960

.998

0.0

82

0.0

37

0.1

27

0.0

65

0.1

55

0.2

76

20.1

25

EQ

SIRT

(ML)

.930

.998

0.1

23

0.0

41

0.1

59

0.0

68

0.0

64

0.5

23

20.3

40

IRT

PRO

(BA

EM

).8

88

.998

0.1

92

0.0

41

0.2

31

0.0

67

20.0

72

0.0

85

20.3

51

IRT

PRO

(AD

Q)

.822

.998

0.2

63

0.0

38

0.3

02

0.0

66

20.2

28

0.0

06

20.2

81

IRT

PRO

(MH

RM

).9

05

.998

0.1

74

0.0

05

0.2

11

0.0

56

20.0

25

0.0

58

20.3

36

Flex

MIR

T(B

AEM

).8

88

.998

0.1

92

0.0

41

0.2

31

0.0

68

0.2

40

20.1

60

20.2

32

Flex

MIR

T(M

HR

M)

.903

.998

0.1

63

0.0

13

0.2

06

0.0

59

0.0

77

0.0

11

0.0

03

Model

2B

b(w

ithin

)M

plu

s(M

L).9

80

.998

.036

0.0

41

0.0

86

0.0

68

——

—M

plu

s(M

C)

.976

.998

.037

0.0

37

0.0

90

0.0

65

——

—EQ

SIRT

(ML)

c.9

30

.998

.123

0.0

41

0.1

59

0.0

68

0.0

64

c0.5

23c

20.3

40

c

IRT

PRO

(BA

EM

).9

80

.998

.036

0.0

41

0.0

86

0.0

67

——

—IR

TPRO

(AD

Q)

.980

.998

0.0

37

0.0

37

0.0

86

0.0

66

——

—IR

TPRO

(MH

RM

).9

77

.998

0.0

42

0.0

05

0.0

88

0.0

56

——

—Fl

exM

IRT

(BA

EM

).9

80

.998

0.0

36

0.0

41

0.0

86

0.0

67

——

—Fl

exM

IRT

(MH

RM

).9

78

.998

0.0

41

0.0

09

0.0

84

0.0

58

——

(con

tinue

d)

492

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Tab

le1.(c

ontinued

)

Corr

elat

ion

Bia

sM

AE

Cova

rian

cem

atri

x

Slope

Thre

shold

Slope

Thre

shold

Slope

Thre

shold

Cov(

a1,a

2),

True

=0.1

9C

ov(

a1,a

3),

True

=0.4

6C

ov(

a2,a

2),

True

=0.3

5

Model

3(b

etw

een

+w

ithin

)M

plu

s(M

L).9

92

.999

0.0

51

0.0

37

0.0

66

0.0

54

0.1

87

0.4

39

0.2

87

Mplu

s(M

C)

.991

.998

0.0

52

0.0

40

0.0

70

0.0

56

0.2

15

0.4

92

0.3

05

EQ

SIRT

(ML)

.992

.999

0.0

51

0.0

37

0.0

66

0.0

54

0.1

88

0.4

39

0.2

87

IRT

PRO

(BA

EM

).9

92

.999

0.0

51

0.0

37

0.0

66

0.0

54

0.1

87

0.4

39

0.2

87

IRT

PRO

(AD

Q)

.992

.999

0.0

51

0.0

35

0.0

66

0.0

53

0.1

87

0.4

30

0.2

82

IRT

PRO

(MH

RM

).9

92

.999

0.0

51

0.0

33

0.0

67

0.0

52

0.1

90

0.4

13

0.2

88

Flex

MIR

T(B

AEM

).9

92

.999

0.0

51

0.0

37

0.0

66

0.0

54

0.1

87

0.4

38

0.2

86

Flex

MIR

T(M

HR

M)

.992

.999

0.0

58

0.0

33

0.0

72

0.0

53

0.1

81

0.3

41

0.2

15

Model

4(b

ifact

or)

Mplu

s(M

L).9

80

.998

0.1

04

0.0

33

0.1

21

0.0

63

Mplu

s(M

C)

.974

.998

0.1

02

0.0

28

0.1

25

0.0

62

EQ

SIRT

(ML)

.983

.999

20.0

05

20.0

06

0.0

72

0.0

43

IRT

PRO

(BA

EM

).9

83

.999

20.0

05

20.0

07

0.0

72

0.0

42

IRT

PRO

(AD

Q)

.983

.999

20.0

04

20.0

09

0.0

72

0.0

42

IRT

PRO

(MH

RM

).9

64

.999

0.0

36

20.0

10

0.0

86

0.0

43

Flex

MIR

T(B

AEM

).9

83

.999

20.0

05

20.0

07

0.0

72

0.0

42

Flex

MIR

T(M

HR

M)

.972

.999

0.0

34

20.0

09

0.0

81

0.0

42

Not

e.M

AE

=m

ean

abso

lute

erro

r;M

L=

max

imum

likel

ihood;B

AEM

=Bock

–A

itki

nex

pec

tation–m

axim

izat

ion;

AD

Q=

adap

tive

qua

dra

ture

appro

ach;

MH

RM

=M

etro

polis

–H

astings

Robbin

s–M

onr

oM

ethod.

a InM

odel

2A

,th

eco

vari

ance

mat

rix

was

free

lyes

tim

ated

.bIn

Model

2B,th

eco

vari

ance

mat

rix

was

not

estim

ated

but

fixed

toth

etr

ue

valu

es.

c Not

work

edas

inte

nded

.Se

tto

fixth

eco

vari

ance

mat

rix

valu

esbut

EQ

SIRT

did

not

fixth

em.

493

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Item calibration performance in the event of missing data was also evaluated, and the results

are reported in Table 2. Only Models 1, 2B, and 4 were studied, using a total of 90 items,

although each simulee was administered with only 30 items chosen at random. Therefore, with

about 1,000 observed responses per item, the sparseness of the response matrix (i.e., percentage

of missed responses) was 66.6%. Although it showed options for handling missing data in its

graphic interface, EQSIRT could not handle missing data for IRT calibration; all other pro-

grams ran successfully, however. As shown in Table 2, all studied programs except EQSIRT

showed correlations between the true and estimated values that were higher than 0.99 for

threshold and 0.93 for slope, which was slightly lower than the non-missing data conditions

shown in Table 1. There was practically no bias in estimating the slopes and thresholds when

the data had missing responses, and the estimation errors (MAE) were slightly larger than what

was observed in Table 1. This was not unexpected given the decrease in the number of response

data available for each item (3,000 responses under the nonmissing data conditions, compared

with about 1,000 responses under the missing data conditions).

The elapsed time for item calibrations with the studied programs with different selection of

estimation methods varied greatly as shown in Table 3. Each program has different default set-

tings and tunings for each of the studied estimation methods (e.g., different maximum number

of cycles for iterative calibrations processes, different criteria for convergence, different number

of quadrature points, etc.). Having said that, direct comparisons on the elapsed times under the

conditions studied with the program default convergence setting may not be a critically meaning-

ful measure for evaluating performance of the programs, but they still offer helpful information.

Among the compared programs, EQSIRT with the MML estimation method took the longest time

across all studied conditions, likely because EQSIRT was the only one of the studied programs

that did not support multithread processing. The BAEM method implemented within IRTPRO

and flexMIRT tended to take more time than did the similar ML method implemented in Mplus.

The MHRM method with IRTPRO achieved the shortest elapsed time (less than 4 min) in all stud-

ied conditions/models. The same MHRM method implemented within flexMIRT tended to take

slightly more time than did IRTPRO. The MC method of Mplus also displayed very fast perfor-

mance speed (less than 10 min) in all studied conditions. The ADQ method within IRTPRO was

the fastest (under 19 min) among non-MC based estimation methods.

User Interface (UI) and Documentation

Although Mplus features a basic syntax generator via dialog boxes, the main UI of Mplus is

not much more than a text editor, in which users directly write syntax commands and edit input

files. Mplus supports various text-based data formats and can be run easily in a batch mode.

Most outputs (except for the factor scores—that is, theta estimates) are saved in a single file,

which sometimes makes it difficult to extract important results such as item parameter esti-

mates. Error messages of Mplus are usually sufficiently detailed to enable users to locate the

problems. Mplus also provides useful warning notes in the output when users may need to use

cautions for interpreting the results.

IRTPRO offers a complete point-and-click UI as well as a text editor for editing syntax com-

mands. IRTPRO only supports its own proprietary data file format (*.ssig) to run the software.

Although IRTPRO provides features for importing data from other file formats, the lack of

capability to work directly on various file formats can limit the usability of the software espe-

cially when it is set to run in a batch mode for massive data analyses. For outputs, IRTPRO

allows users to save several key results in separate files. The main output is presented in a well-

organized .html format.

494 Applied Psychological Measurement 38(6)

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Tab

le2.

Item

Par

amet

erR

ecove

ryW

hen

With

Mis

sing

Dat

a(3

0It

ems/

3,0

00

Sim

ule

es/A

bout

1,0

00

Res

ponse

sper

Item

).

Corr

elat

ion

Bia

sM

AE

Cova

rian

cem

atri

x

Slope

Thre

shold

Slope

Thre

shold

Slope

Thre

shold

Cov(

a1,a

2),

True

=0.1

9C

ov(

a1,a

3),

True

=0.4

6C

ov(

a2,a

2),

True

=0.3

5

Model

1(b

etw

een)

Mplu

s(M

L).9

48

.996

0.0

42

0.0

04

0.1

10

0.0

68

0.2

21

0.4

48

0.3

49

Mplu

s(M

C)

.948

.996

0.0

46

0.0

05

0.1

11

0.0

68

0.2

33

0.4

64

0.3

57

IRT

PRO

(BA

EM

).9

48

.996

0.0

42

0.0

04

0.1

10

0.0

68

0.2

20

0.4

50

0.3

50

IRT

PRO

(AD

Q)

.948

.996

0.0

42

0.0

03

0.1

10

0.0

68

0.2

21

0.4

48

0.3

49

IRT

PRO

(MH

RM

).9

47

.996

0.0

42

0.0

03

0.1

11

0.0

68

0.2

22

0.4

45

0.3

51

Flex

MIR

T(B

AEM

).9

48

.996

0.0

42

0.0

04

0.1

10

0.0

68

0.2

21

0.4

48

0.3

49

Flex

MIR

T(M

HR

M)

.949

.996

0.0

40

0.0

04

0.1

08

0.0

67

0.2

14

0.4

22

0.3

32

Model

2B

(within

B)

Mplu

s(M

L).9

32

.997

0.0

33

0.0

06

0.1

25

0.0

69

——

—M

plu

s(M

C)

.933

.997

0.0

28

0.0

01

0.1

24

0.0

70

——

—IR

TPRO

(BA

EM

).9

32

.997

0.0

33

0.0

07

0.1

25

0.0

69

——

—IR

TPRO

(AD

Q)

.932

.997

0.0

33

0.0

04

0.1

25

0.0

69

——

—IR

TPRO

(MH

RM

).9

31

.997

0.0

35

20.0

02

0.1

25

0.0

69

——

—Fl

exM

IRT

(BA

EM

).9

32

.997

0.0

33

0.0

07

0.1

25

0.0

69

——

—Fl

exM

IRT

(MH

RM

).9

32

.997

0.0

33

0.0

00

0.1

26

0.0

69

——

—M

odel

4(b

ifact

or)

Mplu

s(M

L).9

33

.997

0.0

28

0.0

00

0.1

21

0.0

67

——

—M

plu

s(M

C)

.934

.997

0.0

25

0.0

01

0.1

22

0.0

66

——

—IR

TPRO

(BA

EM

).9

33

.997

0.0

28

0.0

00

0.1

21

0.0

67

——

—IR

TPRO

(AD

Q)

.933

.997

0.0

28

20.0

02

0.1

21

0.0

67

——

—IR

TPRO

(MH

RM

).9

30

.997

0.0

29

0.0

01

0.1

22

0.0

67

——

—Fl

exM

IRT

(BA

EM

).9

32

.997

0.0

28

20.0

01

0.1

21

0.0

67

——

—Fl

exM

IRT

(MH

RM

).9

31

.997

0.0

29

20.0

03

0.1

22

0.0

67

——

Not

e.M

AE

=m

ean

abso

lute

erro

r;M

L=

max

imum

likel

ihood;B

AEM

=Bock

–A

itki

nex

pec

tation–m

axim

izat

ion;

AD

Q=

adap

tive

qua

dra

ture

appro

ach;

MH

RM

=M

etro

polis

–H

astings

Robbin

s–M

onr

oM

ethod.

495

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Tab

le3.

Ela

pse

dT

ime

for

Item

Cal

ibra

tion

(Hour:

Min

ute)

.

Dat

aC

onditio

nM

plu

s(M

L)M

plu

s(M

C)

EQ

SIRT

(MM

L)IR

TPRO

(BA

EM

)IR

TPRO

(AD

Q)

IRT

PRO

(MH

RM

)Fl

exM

IRT

(BA

EM

)Fl

exM

IRT

(MH

RM

)

3,0

00

sim

ule

es30

item

sN

om

issi

ng

dat

a

Model

10:2

60:0

13:0

42:0

80:0

20:0

10:4

40:0

2M

odel

2A

0:1

40:0

69:3

83:4

80:0

40:0

31:0

50:0

4M

odel

2B

0:1

40:0

510:2

3a

2:5

30:0

50:0

12:0

00:1

2M

odel

30:0

70:0

13:1

62:3

00:0

30:0

11:2

60:0

3M

odel

40:0

50:0

59:3

90:0

30:1

50:0

30:0

20:0

7

3,0

00

sim

ule

es90

item

s67%

mis

sing

dat

a

Model

10:2

50:0

2N

ot

support

ed3:5

60:0

20:0

20:5

60:0

3M

odel

2B

0:3

70:0

9N

ot

support

ed12:4

80:0

50:0

31:1

60:1

4M

odel

40:1

30:0

5N

ot

support

ed0:1

50:1

80:0

30:0

10:0

9

Not

e.M

L=

max

imum

likel

ihood;M

ML

=m

argi

nal

max

imum

likel

ihood;BA

EM

=Bock

–Aitki

nex

pec

tation–

max

imiz

atio

n;A

DQ

=ad

aptive

quad

ratu

reap

pro

ach;M

HR

M=

Met

ropolis

–H

astings

Robbin

s–M

onro

Met

hod.

a Not

work

edas

inte

nded

.Se

tto

fixth

eco

vari

ance

mat

rix

valu

esbut

EQ

SIRT

did

not

fixth

em.

496

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Like IRTPRO, EQSIRT has a complete point-and-click solution for user interfacing. The

main interface of EQSIRT is well organized and highly flexible so that users can easily work on

multiple projects simultaneously. The syntax editor is very user-friendly with automatic color

coding on the syntax commands. EQSIRT has its own proprietary format for data files, but users

still can directly access fixed format data without converting them. The main output is provided

in both text format and .html format, with .xml format also available. Error messages provided

in dialog boxes did not seem to be helpful in identifying or fixing problems in syntax and data.

During this study, the program often stopped running without displaying any error messages.

Similar to Mplus, flexMIRT mostly has a syntax-based interface with no real graphical UI

elements. flexMIRT has no feature to generate syntax commands; users must write syntax com-

mands on their own. The program can natively handle data in space-, comma-, or tab-delimited

formats but does not support fixed format. Users have options to save key results in separate

output files. The program provides simple error messages that were useful for identifying and

fixing problems with syntax and/or data.

All studied programs come with user manuals and output examples. Mplus offers the most

comprehensive manual contents including hundreds of examples for most of the models and

analyses that the program is designed to handle, but the contents covering IRT modeling are

very limited. Mplus has a large user forum where thousands of examples, discussions, and

Q&A can be found. The software’s technical support is also strong; generally most questions

are answered within 24 hr. The other programs—IRTPRO, EQSIRT, and flexMIRT—have

much shorter histories in the field, and fewer resources except for examples provided in the

manual and technical support.

Availability and Price

Mplus and EQSIRT are available for Windows, Mac OS X, and Linux platforms; IRTPRO and

flexMIRT are available only for Windows systems. IRTPRO requires administrator operation

every time it is run; otherwise, it does not start. This can be a serious limitation for users who

need to work on company-owned and-maintained computers because most users in such work

environment do not normally have administrator privilege. Versions with the 64-bit architecture

for 64-bit Windows systems are available for Mplus and flexMIRT.

Program prices vary significantly. For academic users, a single user license costs US$595

for Mplus, US$495 for IRTPRO (for a single installation), and US$595 for EQSIRT. Student

versions of Mplus and EQSIRT are available for less than US$200. flexMIRT does not offer a

permanent license option, instead being a subscription-based product costing US$125 per year

(up to three installations) for academic users. All prices are subject to change. All programs also

offer a wide range of pricing options for rental and commercial use, while offering trial versions

that have limited time use or limited features.

Conclusion

All MIRT software packages that were evaluated and compared in this study are capable of esti-

mating MIRT models. The results from the parameter recovery study with simulation suggest

that they all worked very well with a high level of estimation accuracy (except for Model 2A,

where the covariance matrix for latent variables was not identifiable unless additional con-

straints were imposed). Each program, however, has its own unique favorable features and

shortcomings. Regarding limitations of the programs evaluated here, Mplus cannot calibrate

models with lower asymptote parameters (i.e., pseudo-guessing). IRTPRO cannot natively han-

dle text-based data formats, EQSIRT cannot handle data with missing responses, and flexMIRT

Han and Paek 497

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

has a syntax-only UI, which may not be easy for new learners. Therefore, it is important for

users to understand the different advantages and limitations across the programs before they

choose a MIRT software package for their purposes.

Authors’ Note

The views and opinions expressed in this article are those of the authors and do not necessarily reflect those

of the Graduate Management Admission Council�.

Acknowledgments

The authors are grateful to Paula Bruggeman of Graduate Management Admission Council� for editorial

review.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub-

lication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

1. At the time of this writing, the developers of EQSIRT are aware of the issue and reportedly are work-

ing to fix the problems.

2. This bug was reported to the software developers.

References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An

application of the EM algorithm. Psychometrika, 46, 443-459.

Cai, L. (2010a). High-dimensional exploratory item factor analysis by a Metropolis-Hasings Robbins-

Monro algorithm. Psychometrika, 75, 33-57.

Cai, L. (2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis.

Journal of Educational and Behavioral Statistics, 35, 30-335.

Cai, L. (2013). flexMIRT: A numerical engine for flexible multilevel multidimensional item analysis and

test scoring (Version 2.0) [Computer software]. Chapel Hill, NC: Vector Psychometric Group.

Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows [Computer software].

Lincolnwood, IL: Scientific Software International.

Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied

Psychological Measurement, 31, 457-459.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-

Wesley.

Muthen, L. K., & Muthen, B. O. (1998-2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Author.

Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis

by adaptive quadrature. Psychometrika, 70, 533-555.

Wu, E. J. C., & Bentler, P. M. (2013). EQSIRT: A comprehensive item response theory program

[Computer software]. Encino, CA: Multivariate Software.

498 Applied Psychological Measurement 38(6)

at NATIONAL TAIWAN NORMAL UNIV LIB on December 7, 2016apm.sagepub.comDownloaded from

Information for Readers

Applied Psychological Measurement2014, Vol. 38(8) NP2

� The Author(s) 2014Reprints and permissions:

sagepub.com/journalsPermissions.navDOI: 10.1177/0146621614554510

apm.sagepub.com

Corrigendum

Han, K. T., & Paek, I. (2014). A Review of Commercial Software Packages for Multidimensional

IRT Modeling. Applied Psychological Measurement, 38(6), 486-498. Original DOI: 10.1177/

0146621614536770.

1. On page 487, ‘‘Mplus can handle a maximum of 500 variables (i.e., items), which

might be too small for some large-scale tests with large item banks’’ was falsely

stated based on older versions of Mplus. The recent versions (6.11 and up) no longer

have this limitation.

2. On pages 492, 493, and 495, in Tables 1 and 2, the table header of the last column

should be ‘‘Cov(a2, a3), True = 0.35’’, and not ‘‘Cov(a2,a2), True = 0.35.’’