polytomous irt or testlet model: an ... - ufdc image array...

1

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

By

OU ZHANG

A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION

UNIVERSITY OF FLORIDA

2010

2

© 2010 Ou Zhang

3

To my Dad who has supported me, believed me in, and encouraged me to start this long way. He is my hero!

4

ACKNOWLEDGMENTS

I would like to express my sincere appreciation to Dr. M. David. Miller, my

committee chair, for providing valuable guidance and continuous support. I would also

like to thank Dr. James J. Algina, my committee member, for sharing his ideas and

corrections on this project.

My deepest gratitude goes to my parents and my wife, Bei Li, for their constant

support and love. Thanks to my summer internship mentor, Dr. Feiming Li and Vice

President, Dr. Linjun Shen for giving me such a valuable opportunity to enter the

educational measurement industry. Thanks to my friend Yan Cao for her patience and

help over years. Last, thanks go out to Dr. Andrich for his comment and suggestion.

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF TABLES ............................................................................................................ 7

ABSTRACT ..................................................................................................................... 9

CHAPTER

1 INTRODUCTION .................................................................................................... 10

1.1 Model Selection ................................................................................................ 14

1.2 Survey of the Testlet Size in Applications of Testlet ......................................... 15

1.3 Purpose of the Study ........................................................................................ 16

2 LITERATURE REVIEW .......................................................................................... 19

2.1 Item Response Theory ...................................................................................... 19

2.1.1 IRT Assumptions ................................................................................... 20

2.1.2 One-Parameter Logistic Model (1-PL Model or Rasch Model) .............. 20

2.1.3 Polytomous Item Response Theory (IRT) Model-Partial Credit Model .. 21

2.1.4 Testlet Model-Rasch Testlet Model ....................................................... 22

2.1. 5 Local Item Dependence ....................................................................... 23

2.2 Reliability .......................................................................................................... 25

2.3 Survey in Application of Testlet ......................................................................... 26

3 METHOD ................................................................................................................ 34

3.1 Model Used to Generate Data .......................................................................... 34

3.2 Population Parameters ..................................................................................... 35

3.3 Condition Manipulated ...................................................................................... 35

3.4 Data Generation ............................................................................................... 36

3.5 Parameter Estimation ....................................................................................... 37

3.6 Ability Estimation............................................................................................... 39

3.7 Analysis ............................................................................................................ 41

3.7.1 Bias ....................................................................................................... 41

3.7.2 Root Mean Square Error (RMSE) .......................................................... 42

3.7.3 Reliability ............................................................................................... 42

4 RESULTS ............................................................................................................... 44

4.1 MLE Non-convergence Issue ............................................................................ 44

4.2 Test Reliability .................................................................................................. 44

4.3 Standard Error of Measurement ....................................................................... 46

4.4 Bias and RMSE ................................................................................................ 47

6

4.5 An Empirical Case ............................................................................................ 49

5 DISCUSSION ......................................................................................................... 90

5.1 General Discussion ........................................................................................... 90

5.2 Limitations and Suggestions for Future Research ............................................ 91

5.3 Conclusion ........................................................................................................ 92

LIST OF REFERENCES ............................................................................................... 94

BIOGRAPHICAL SKETCH .......................................................................................... 101

7

LIST OF TABLES

page 1-1 Testlet size in the article reviews ........................................................................ 18

2-1 The number of testlets in the dataset ................................................................. 29

2-2 Test length in the reviewed articles .................................................................... 30

2-3 Sample sizes in the reviewed articles ................................................................. 31

2-4 Fit indices in reviewed articles ............................................................................ 33

2-5 Estimation method in reviewed articles .............................................................. 33

2-7 The number of simulation replication applied in the reviewed articles ................ 33

3-1 Study design condition with 3 factors ................................................................. 43

4-1 MLE nonconvergence case and rate per condition-testlet size 3 ....................... 53

4-2 MLE nonconvergence case and rate per condition-testlet size 5 ....................... 53

4-3 Test reliability-testlet size 3 conditions ............................................................... 54

4-4 Test reliability-testlet size 5 conditions ............................................................... 55

4-5 Testlet size 3 the results of the Spearman-Brown prophecy .............................. 56

4-6 Testlet size 5 the results of the Spearman-Brown prophecy .............................. 57

4-7 Mean standard error of measurement for each condition (testlet size 3) ............ 58

4-8 Mean standard error of measurement for each condition (testlet size 5) ............ 59

4-9 Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) ...................... 60

4-10 Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) ...................... 62

4-11 Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 64

4-12 Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ............................................................................ 66

4-13 Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery

(EAP) with 6 different ability intervals ................................................................. 68

8

4-14 Rasch testlet model (Testlet Size 5) Bias of Ability ( ) Estimate Recovery

(EAP) with 6 Different Ability Intervals ................................................................ 70

4-15 Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP)

with 6 different ability intervals ............................................................................ 72

4-16 Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 74

4-17 Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery


4-18 Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery


4-19 Standard Rasch model (testlet size 3) RMSE of ability ( ) estimate recovery


4-20 Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery


4-21 Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 84

4-22 Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals ............................................................................ 86

4-23 NBOME LEVEL-2 Block 1 Item WMSE .............................................................. 88

4-24 COMLEX-Level 2 2008 block-1 local item dependence detection results .......... 89

9

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Arts in Education

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

By

Ou Zhang

December 2010

Chair: M. David Miller Major: Research and Evaluation Methodology

This study investigated the effectiveness of ability parameter recovery for three

models to detect the influence of the local item dependence across testlet items under

the small testlet size situation. A simulation study was used to compare three Rasch

type models, which were the standard Rasch model, the partial credit model, and the

Rasch testlet model. The results revealed that both the partial credit model and Rasch

testlet model performed better than the standard Rasch model as the existence of local

item dependence within testlet. The results also indicated that as the sample size

increases, the discrepancies between model estimates and the real data set increases.

The study concluded that using the polytomous IRT model for testlet item analyses is

still efficient for small testlet size and non-adaptive typed tests. Moreover, for small

testlet sizes, polytomous IRT models are more stable than the Rasch testlet model

when there are a large number of the testlets included in a test. In sum, the polytomous

IRT model and Rasch testlet model offers an advantage over the standard Rasch model

as it avoids standard error of measurement underestimation and better ability parameter

estimations in the small testlet size situations.

10

CHAPTER 1 INTRODUCTION

Item response theory (IRT) models are commonly used in educational and

psychological testing. Employing item response theory allows for assessing latent

human characteristics and quantifying underlying traits. IRT holds a major assumption -

- local item independence. Local item independence (LID) assumes items in the test are

unrelated with each other, after controlling for the underlying trait. However, the LID

assumption can be commonly violated in real world applications. In fact, many real

world tasks require solving related problems or solving a single problem in stepwise

fashion. In accordance with such circumstances, the exam includes items within a

subset sharing a single content stimulus. The items sharing the same stimuli are

grouped as a unit, termed as an item bundle (Rosenbaum, 1988) or testlet (Wainer &

Kiely, 1987).

An item bundle or testlet, hence forward referred to as a testlet, is a scoring unit

within a test that is smaller than a test (Wainer & Kiely, 1987). Items within testlets are

locally dependent because they are associated with the same stimulus. Moreover, local

item dependence introduces unintended dimensions into the test at the construct of

interest‟s expense (Wainer & Thissen, 1996). Thus, the challenge for the test developer

is not to eliminate the item dependencies, but rather to find a proper solution so that

such local item dependence does not impact the test reliability and the validity of

inferences from the test. More specifically, the violation of the assumption of local item

independency may lead to an underestimate of the standard errors and could result in

(a) bias in item difficulty estimates, (b) inflated item discrimination estimates, (c)

overestimation of the precision of examinee scores, and (d) overestimation of test

11

reliability and test information. This last result can lead to inaccurate inferences that

may result in a greater chance of misclassification when making decisions regarding

examinee ability categorization (Sireci, Thissen, & Wainer, 1991; Yen, 1993).

Therefore, some models were proposed as solutions to the violation of the local

item independence assumption. One of the methods is to treat such testlet items as a

single super polytomous item in the analysis (Sireci, Thissen, & Wainer, 1991; Thissen,

Steinberg, & Mooney, 1989; Wainer, 1995). This method leans heavily on Rosenbaum‟s

theorem of item bundles (Rosenbaum, 1988) using a polytomous (IRT) model to score

the locally independent testlets. The key idea is that the items that form each testlet

may have excessive local dependence, but that once the entire testlet is considered as

a single unit and scored polytomously these local dependencies may disappear. The

item scores are summed within each testlet. When the total scores in a testlet are

identical, they will be assigned to the same category. This method allows researchers to

score testlets polytomously. Once the summed item scores are obtained, testlet type

item responses are calibrated by applying polytomous item response models, such as

the Graded Response Model (Samejima, 1969), the Partial Credit Model (Masters,

1982), the Rating Scale Model (Andrich, 1978), or the Nominal Response Model (Bock,

1972). In using a polytomous IRT model to score testlets, the data can be analyzed

while maintaining local independence across different testlets. This approach avoids the

overestimation of the test reliability and information so that the statistics of the

polytomous IRT model consistently perform better than the standard Rasch model in

such circumstance.

12

However, this approach has some weaknesses when it is applied to manipulate

testlet-type data set. Some major shortcomings of polytomous IRT models‟ have been

discussed (Thissen, Billeaud, McLeod, & Nelson, 1997; Yen, 1993; Wainer & Wang

2000). First, when polytomous IRT models are applied, some test information, the

precise pattern of responses the examinee generates, is lost. In addition, some

parameters are dropped from the polytomous model compared to the individual

dichotomous item-scoring. Third, it is inappropriate if the test is administered adaptively.

Last but not least, the test reliability might be underestimated (Yen, 1993). Wainer

(1995) claimed that using a polytomous IRT model to manage testlets might be

appropriate when the local dependence between items within a testlet is moderate and

the testlet-type items only take a small proportion of the entire test.

The other method, the testlet model (Wainer & Kiely, 1987), is explicitly introduced

as an alternative to the polytomous IRT model and attempts to solve the same problem.

IRT testlet models have been proposed in which a random effect parameter is added to

model the local dependence among items within the same testlet (Bradlow, Wainer, &

Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). As one

random effect parameter is added to the model, an additional latent trait is also added to

the testlet model. Thus, the testlet model proposed by Wainer& Wang (2000) is a

special case of a multidimensional IRT model (MIRT).

Proposed by Wang and Wilson (2005b), the Rasch testlet model is a special case

of testlet model (Wainer & Wang, 2000) by combining special features of the Rasch

model and the testlet model. In so doing it makes use of several desirable measurement

and psychometric properties of the Rasch model (Wang & Wilson 2005). First, the

13

Rasch model has observable sufficient statistics for the model parameters and a

relatively small sample size requirement for parameter estimation. Second, no

distributional assumption on the item parameters is necessary in Rasch models since

the items are treated as fixed effects. Therefore, the Rasch model is widely applied in

testing and scoring. Because of such advantages of the Rasch model, Wang and

Wilson (2005a, 2005b) showed that it is possible to model locally dependent items in

relation to testlets by using a Rasch testlet model so that more precise and adequate

estimates are obtained. The Rasch testlet model is the special case of the testlet model

(Wainer &Wang, 2000).

Before the testlet model was proposed, polytomous IRT models were the primary

method to analyze testlets. Currently, both approaches are widely used for testlet

analyses and no doubt, both approaches have pros and cons. Thus, the theoretical

reason to choose the testlet model (Wainer &Wang, 2000) over polytomous IRT model

in testlet analysis might be obvious. However, some potential caveats of the testlet

model should also be considered. First, the testlet model is more complex than both the

standard IRT model and the polytomous IRT model because it adds more “testlet

parameters”. Second, when one testlet parameter is added in the model, an additional

latent trait is also added in the model so that multidimensionality occurs and results in

increased complexity of analysis. Thus, the model analysis process is extremely

prolonged and some potential issues emerge (e.g. sometimes the model calibration fails

to converge). Therefore, the benefits in using the testlet model (Wainer& Kiely, 1987)

should be weighed against the added complexity in data analysis.

14

1.1 Model Selection

Although Wainer and Wang (2000) addressed advantages of the testlet model

over the polytomous model in applied testlet analyses, it remains important to compare

these two models under various conditions. Pitt, Kim, and Myung (2003) indicated that

the goal of model selection was not just to find the model that provide the maximum fit

to a given data set, but to identify a model from a set of competing models, that best

captures the characteristics or trends underlying the cognitive process of interest.

Briefly, the best model is the model that matches the purpose of the study and can

explain all of the important features of the actual data without adding unnecessary

complexity.

The model‟s analysis efficiency is the other issue that must be considered. Two

realistic circumstances should be noted in model selection for analyzing testlets. First is

that of testlet size. In previous testlet research, in order to obtain illustrative results to

support hypotheses, testlet sizes were usually set from 5 to 10 or more items (e.g.,

Adams,Wilson & Wang,1997; Wang & Wilson, 2005; Brandt, 2008; Wainer & Wang,

2000; Wainer & Lewis, 1990). Small and medium testlet sizes (2-4 items) were rarely

applied (Ip, Smits & De Boeck,2009; Tokar, Fischer, Snell & Harik-Williams, 1999;

DeMars,2006). This is potentially problematic because in some exams, like National

Board of Osteopathic Medical Examiner (NBOME)‟s Comprehensive Osteopathic

Medical Licensing Examinations (COMLEX)-USA exam, testlet sizes are often small.

The second issue to consider is that of non-adaptive tests. Non-adaptive tests are still

widely used in the educational and psychological measurement field. Because of the

small testlet sizes and non-adaptive features, the loss of response pattern information is

not that serious for this kind of tests. In this manner, more concerns are given for the

15

model comparison between two models when the aforementioned shortcomings of

polytomous IRT model applying in testlet analysis are minimal. In addition, the local

dependence effect ( )(id ) within testlet varies. Since the local dependence effect ( )(id )

is avoided when polytomous IRT models are applied, the extent to which the local

dependence effect ( )(id ) influences the fit of polytomous model in testlet analysis

should also draw attention.

1.2 Survey of the Testlet Size in Applications of Testlet

Very little research has focused on model comparison between the polytomous

IRT model and the testlet model initially proposed by Wainer and Wang (2000)

regarding model fit, ability parameter recovery, and test reliability as testlet conditions

change, especially when testlet size and local dependence effect ( )(id ) are at a

medium level. A review of the literature to identify the application of testlets was

conducted in the EBSCO Host and PsychInfo databases for the keywords “testlet” and

“testlets” to identify studies that included testlet characteristics in the test between 1989

and 2009. A total of fifty-five articles relevant to the testlet were found and reviewed

(see the reference list from Appendix B) .Among all fifty-five testlet-related articles, forty-

five articles have specific descriptions regarding the factors that could influence the

testlet analysis in testlet research designs (i.e. testlet size, the number of testlets within

a test, sample size, etc.). The remaining ten articles, which include two book reviews,

conceptually describe testlet theory and application.

Issues of testlet size within the testlet have been well documented in the literature.

In these forty-five testlet-relevant articles, only four articles solely applied small testlet

size designs (i.e. testlet size smaller than five). The other forty-one articles have a

16

mixture of testlet size designs, although most of the articles included moderate and

large testlet size designs (testlet size larger than 5) in their research. Of these forty-one,

there were twelve articles that considered the small testlet size designs. Over thirty-five

articles included the testlet sizes between 5 and 10 and twelve articles included large

testlet size conditions, larger than 10. Overall, 16 articles (35.6%) investigated small

testlets and only 12 compared the small and medium testlet sizes. The detailed results

are shown in Table 1-1.

In sum, this study adds to this literature by investigating the results of three

different models of testlet-type data under the small and medium testlet size

circumstances (i.e. testlet size small than or equal to 5). Testlet size, local dependence

effect, sample size, and the ratio of testlet/independent items are factors in this study.

We examine model fit, test reliability, and the ability parameter recovery of the three

different models (i.e., Rasch model, Partial Credit model, and Rasch testlet model)

employed in a testlet-type data analysis.

1.3 Purpose of the Study

In accordance with previous testlet research, one of the research purposes

inherent to this study is exploring the consequences of variation in testlet size and local

dependence effects on test reliability, standard error of measurement, and ability

parameter recovery of the standard Rasch model, the Partial Credit model, and the

Rasch testlet Model. By looking for the trend of how changes in testlet factors (i.e.

testlet size, local dependence effect, sample size, testlet/independent item ratio) affect

different models‟ estimations and the test reliability corresponding to the models, a

guide for model selection is expected to emerge.

17

The other essential goal of this study is to determine which model performs the

best at person ability parameter recovery by considering the trade-off of the test

reliability and analysis complexity. An answer to these questions will be useful to

provide evidence as a reference for researchers interested in applying IRT models to

measure tests appropriately. Furthermore, since we use data from the NBOME

COMLEX-USA examination, it will provide guidance for future improvements in the

estimation of this exam.

18

Table 1-1. Testlet size in the article reviews

Testlet size (m) m <5 5<m<10 11<m<15 16<m<20 21<m<25 m>25

articles 16 35 6 2 3 1

proportion 35.56% 77.78% 13.33% 4.44% 6.67% 2.22%

Note: m is the number of items in the testlet

19

CHAPTER 2 LITERATURE REVIEW

In this section, the theoretical framework of this research is given. Several

important parts are included: IRT theory, IRT assumption, IRT models used in this

research, local item dependence, and test reliability.

2.1 Item Response Theory

Item Response Theory (IRT), proposed by Lord (1952), is a family of statistical

models for analyzing item responses in a population of individuals. It depicts the

relationship between examinees and items through mathematical models (Wainer &

Mislevey, 2000). Many mathematical models can be developed within the IRT

framework. There are two general types of IRT models, dichotomous IRT models and

polytomous IRT models. Dichotomous IRT models are used to model items with only

correct or incorrect response option. One-Parameter Logistic (1PL), Two-Parameter

Logistic (2PL), and Three-Parameter Logistic (3PL) IRT models are three common

dichotomous IRT models.

Items with more than two response options can be modeled with polytomous IRT

models. Among the polytomous IRT models already suggested, examples of

polytomous IRT models include the Graded Response model (GRM; Samejima, 1969),

the Rating Scale model (RSM; Andrich, 1978), the Partial Credit model (PCM;

Masters,1982), the generalized Partial Credit model (GPCM; Muraki, 1992), and the

Nominal Response model (NRM; Bock, 1972). The noticeable feature of IRT over

classical test theory is that IRT models are invariant to item and ability parameters

(Hambleton, Swaminathan & Rogers, 1991). According to this invariance feature, item

parameters (e.g., difficulty, discrimination and guessing) are not dependent on the

20

ability distribution of any particular group of examinees and the examinee ability

parameters (θs) are not dependent on a specific set of test items.

2.1.1 IRT Assumptions

Two essential a priori assumptions are held by Item Response Theory. The first

assumption of IRT is local item independence: the probability of a correct response to

one item is independent from other items. Local item independence means that the item

responses are independent for a given value of latent trait . The joint probability of a

response pattern for all items in the test is the product of the probabilities of correct

responses to the items for a given latent trait .

)|1()|1( N

i

iXPXP (2-1)

where N is the total number of items.

The second assumption of most general IRT models (e.g. 1PL-, 2PL-, 3PL-

models) is unidimensionality. Early notions of IRT require that the same construct

should be measured by all test items (Loevinger, 1947). As such, all items in the test

only measure a single latent trait (Hambleton & Murray, 1983; Lord, 1980).

2.1.2 One-Parameter Logistic Model (1-PL Model or Rasch Model)

The Rasch model (Rasch, 1960) is the simplest of unidimensional models. The

Rasch model predicts the probability of success for person j on item i and can be given

by the formula:

)exp(1

)exp(),|1(

ij

ij

ijijb

bbyP

(2-2)

where

ijy is examinee j ‟s response category to item i ;

21

j is examinee j ‟s proficiency level;

ib is the difficulty parameter of item i , which indicates the point on the ability

continuum when an examinee has a 50% probability of answering item i correctly.

),|1( iji j byP is the probability that examinee j answers item i correctly, by given

proficiency level j ;

An assumption that is implicit in the model is that all items have the same

discrimination value.

2.1.3 Polytomous Item Response Theory (IRT) Model-Partial Credit Model

In this study, for comparison with the standard Rasch model, the Partial Credit

model was selected as the polytomous IRT model. The Partial Credit model (PCM;

Masters, 1982) was originally developed for analyzing test items that require multiple

steps and for which it is important to assign partial credit for completing several steps in

the solution process. This model was designed to be used when partial credit can be

awarded for degrees of success. The PCM is a divide-by-total or “direct” IRT model.

The Partial Credit model can be considered as an extension of the Rasch Model and

has all the standard Rasch model features.

The equation for the partial credit model is shown below,

im

r

r

k

ik

x

k

ikj

jixP

0 0

0

)(exp

)(exp

)(

(2-3)

where

item i is scored imx ,...,0 for an item with 1 ii mK response categories;

22

ik ( imk ,...,1 ) is called the item step difficulty; it is associated with a category

score

of k and x is the response category of interest.

j is examinee j ‟s proficiency level;

)( jixP is the probability that examinee j answers item i at category x correctly,

by given proficiency level j ;

x

k

ik

0

0)( (2-4)

2.1.4 Testlet Model-Rasch Testlet Model

To model examinees‟ responses to testlet items, IRT testlet models have been

proposed in which a random effect parameter is added to model the local dependence

among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow,

& Du, 2000; Wang, Bradlow, & Wainer, 2002). Following this general approach, a

simplified testlet model was generated by Wang and Wilson (2005) and it can be written

as

)exp(1

)exp(

)(

)(

1

jidij

jidij

jib

bP

(2-5)

where 1jiP is the probability that examinee j answers item i correctly (scoring 1);

)1,0(~ Nj is the ability of examinee j ;

),(~ 2

bbi Nb is the difficulty of item i , and

),0(~ 2

)( )(idNjid is a random effect that represents the interaction of person j

with testlet )(id (i.e., testlet d that contains item i ).

23

2.1. 5 Local Item Dependence

As we mentioned before, local item independence (LID) is the first a priori

assumption of IRT models. It means that the item responses are conditionally

independent given the latent trait. Therefore, there should not be any correlation

between two items after controlling for the underlying trait. The items should only be

correlated through the latent trait that the test is measuring (Lord and Novick, 1968).

However, this LID assumption is nearly always violated in real applications. Sometimes,

significant correlation among items remains after controlling for the effect of the latent

trait. Because of these significant correlations, the items are locally dependent or there

is a subsidiary dimension in the measurement that is not accounted for by the

overarching dimension trait. Locally dependent items are always the cause of

information loss for IRT models (Chen & Thissen, 1997).

Several indices have been proposed to detect local item dependence for

dichotomous item response models. Yen (1984, 1993) introduced the 3Q statistic by

comparing it with other traditional measures: 1Q (Yen, 1981), 2Q (Van den Wollenberg,

1982), and Signed 2Q (Van den Wollenberg, 1982). The 3Q statistic is the inter-item

correlation between item pairs once the effect of the latent trait is removed. Although the

3Q statistic has been commonly used for several years, it has two major deficiencies in

applied settings. First, the 3Q statistic requires a latent trait computation prior to

calculating the item pair residual correlation. Second, the entire set of test data must be

applied to compute the 3Q statistic. Therefore, Chen and Thissen (1997) proposed four

innovative LID indices to compute the expected frequency from IRT models. The

24

calculation of these four local dependence indices uses a subset of items without using

the estimates. These four LID indices are Pearson‟s 2X , Likelihood ratio2G ,

Standardized coefficient difference, and Standardized log-odds ratio difference .

These four indices are defined for a pair of items.

Ponocny (2001) proposed a general family of conditional nonparametric tests to

detect the differences between groups and items for Rasch models including item‟s

local stochastic independence (e.g., 1T ). By creating a two-by-two table for two items,

the comparison can be subjected to the standard 2 -test or Fisher‟s exact test (Fischer,

1974). This test is able to detect the difference of covariance between the item pairs.

The test of the local independence assumption can be conducted via a suitable

contingency table (Ponocny, 2001). The general family of conditional nonparametric

tests is implanted in the following extension of the Rasch model:

n

v

k

i iv

n

v

k

i iivv srATAL

1 1

1 1

))exp(1(

))()()(exp(),|(

(2-6)

where v is the examinee‟s ability parameter;

vr is the examinee‟s raw score;

is is the item marginal sum for the item difficulty parameter i ( ki ,...,1 ).

The random variable T is a sufficient statistic for the parameter which expresses

a certain violation of the Rasch model (Ponocny, 2001). Based on the conditional

nonparametric tests from Ponocny (2001), the local item dependence is demonstrated

by inter-item correlation between item iI and item jI ( Kjiji ,, ). The inter-item

correlation is based on the ( 22 )-table by calculating the cases with equal responses

25

on both items (Ponocny, 2001). The statistic )(1 AT is applied for the local item

dependence detection as below:

vjvixxAT )(1 (2-7)

where vjvixx indicates the Kronecker symbol with 1

vjvi xx for vjvi xx and 0

vjvi xx

otherwise. Then, a goodness-of-fit test is conducted to check the proportions of the

correlation comparison between the model-implied estimates and the observed value

from the matrix for two specific items. The sum of the1T 's over the item pairs serves as

a test statistic when two or more item pairs are investigated simultaneously (Ponocny,

2001). In the meantime, an overall test statistic ( 11T ) for the local dependence of test is

given by summing up the absolute deviation from the expected value ij to all inter-item

correlations ijr in the test. The test statistic 11T is shown as below (Ponocny, 2001):

ij ijijrAT )(1 1 (2-8)

2.2 Reliability

In educational measurement, reliability is a statistical index to quantify and

evaluate the consistency of test scores. If the local item independence assumption is

violated, the measurement errors are underestimated so as to give an inflated reliability

estimate. The circumstances where the local item independence assumption is violated

commonly occur in testlets. The test construct is subject to the impact of measurement

errors that are not related to the latent traits the test construct intends to measure. Thus,

these measurement errors determine how reliably the test measures the construct. Test

reliability has been consistently mentioned in previous testlet research. A concern about

test reliability was expressed regarding the creation of super polytomous items to

26

manipulate testlets (Keller , Swaminathan, & Sireci, 2003). The approach that treats

these testlets as polytomous items may lose the information contained in the response

pattern so that the measurement errors may increase and reduce the overall test

reliability (Keller et al., 2003). In addition, compared to the original dichotomous items,

some parameters are dropped when the polytomous items are formed so the test

reliability may decrease (Zenisky, Hambleton & Sireci, 2002). Yen (1993) also claimed

that, when items are combined into testlet scores and some of the items within a testlet

are locally dependent, the reliability will be underestimated. Thus, the comparison of the

test reliabilities among the three models is especially necessary for model selection.

2.3 Survey in Application of Testlet

The review of the testlet applied literature in the EBSCO Host and PsychInfo

databases also identified the other possible factors that impact the application of testlets

models. The testlet/independent item ratio within a test in terms of testlet number is

another important factor in testlet research. Among the forty-one articles in which testlet

numbers are specified, the general mean of the testlet numbers including sub-

conditions within each article is 7.9 and the standard deviation of the testlet number was

6.70. The largest testlet number design was fifty, which occurred in Wainer and Wang‟s

(2000) article. There is one other study containing a large testlet number in their

research design. Tokar, Fischer, Snell, and Harik-Williams (1991) included twenty

testlets in their research. Except for these two large testlet number designs, all the other

articles (39) contained three to fifteen testlets (e.g., Wainer, Lewis, 1990; Thissen,

Steinberg & Mooney, 1989; Wang, Cheng & Wilson, 2005; Wainer, 1995; Yang & Gao,

2008). This range gave clear guidance for this study‟s research designs. The detailed

27

information on testlet numbers used in previous testlet studies is demonstrated in Table

2-1.

Based on the same literature review, forty-three out of fifty-five studies identified

test lengths in their research designs. Of all available forty-three articles, the distribution

of test length ranged from 13 to 899. The mean test length (64.74) was obtained by first

removing the largest test length (i.e. 899) from Wainer and Wang‟s article (2000);

summing the remaining test lengths and dividing by the number of articles in which the

test length was included in the design (Table 2-2).

The research sample size is the third factor that can influence the analysis of

testlets. From the testlet application literature, thirty-seven articles identified the sample

size, with a mean of 2047.22 and standard deviation of 2105.86. Since some studies

consist of extremely large sample size (i.e. 8912 in Brandt‟s 2008 article and 8494 in

Zenisk, Hambleton & Sireci‟s 2002 study), the median sample size of 681 may be

more illustrative. The range of the sample sizes provides a guideline for our research

design. First, in twelve out of the thirty-seven articles reviewed, researchers included

sample sizes smaller than 500 (e.g., Adams, Wilson & Wang, 1997; Wang, 2005;

Schmitt, 2002). Second, eighteen articles included sample sizes between 500 and 1000

(e.g., Adams, Wilson & Wang, 1997; Stark, Chernyshenko & Drasgow, 2004). Finally,

twenty studies included sample sizes larger than 1000 (e.g., Brandt, 2008; Wainer &

Wang, 2000; Thissen, Steinberg & Mooney, 1989). Table 2-3, details the information on

sample sizes used in previous studies.

Seventeen studies used the RMSE and loglikelihood ratio coefficient as the

extraction criteria (e.g., Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006;

28

Armstrong, 2004). The next most commonly used criteria were reliability coefficient and

Bias (used by nine, and 5 papers respectively) (e.g., Stark, Chernyshenko & Drasgow,

2004; DeMars, 2006; Armstrong, 2004; Davis, 2003; Schmitt, 2002 ). Other various

indices (e.g. AIC, WMSE, RMSEA, NNFI, CFI, GFI, Q3, RMS, etc) were used in twenty

studies (e.g., Gessaroli, Folske, 2002; Schmitt, 2002; Adams, Wilson& Wang, 1997).

Clearly, most researchers relied on the RMSE and loglikelihood ratio coefficient to

compare the model fit and parameter estimates .Table 2-4, reveals detailed information

on the fit criteria used.

Finally, the estimation methods were designated in the twenty-nine studies.

Twenty-four of these articles applied the Marginal Maximum Likelihood (MML) method

(e.g., Lee, 2006; Wang & Wilson, 2005; Wainer, 1995). Only five articles used the

Markov Chain Monte Carlo (MCMC) method (e.g., Li, 2005; Li, 2006; Wang, 2002;

Wainer & Wang, 2000). The data analysis iterations were only acknowledged in eleven

articles (e.g., Lee, 2000; Ip, Smits & De Boeck, 2009; Stark, Chernyshenko & Drasgow,

2004). Among these eleven articles, five of them applied 100 iterations (e.g. Stark,

Chernyshenko & Drasgow, 2004; DeMars, 2006) and only two articles applied even

more (200 and 600) iterations (Li, 2006; Zwick, 2002). Tables 2-5 and 2-6 include

detailed information on the estimation method and iteration times used for all the studies

reviewed.

29

Table 2-1. The number of testlets in the dataset Articles Testlet number Testlet number mean/article

1 2 3

2.5 2 4 8

6.0

3 5

5 4 6

6

5 4

4 6 15

15

7 4

4 8 5

5

9 5

5 10 5

5

11 5 10

7.5 12 7

7

13 4 8 10

7.3 14 6

6

15 6 10

8.0 16 4

4.0

17 2

2.0 18 9 16 14 11 12.5 19 8

8.0

20 16

16.0 21 8 9

8.5

22 5 10

7.5 23 6 3 2

3.7

24 4 5 7 8 6.0 25 4

4.0

26 11

11.0 27 5 7 8

6.7

28 5 7 8

6.7 29 50 36

43.0

30 7

7.0 31 4 5 7 8 6.0 32 3 6

4.5

33 20

20.0 34 4 5 9

6.0

35 5

5.0 36 2 5 6 10 5.8 37 4

4.0

38 4

4.0 39 10

10.0

40 10

10.0 41 10

10.0

Testlet number general mean 7.9 SD of mean 6.70

30

Table 2-2. Test length in the reviewed articles Articles Test Length Mean length

1 64

64 2 194

194

3 76

76 4 120

120

5 22

22 6 50

50

7 60

60 8 125

125

9 30 33 24 40 47 36 40 42 50 51 54 40.64 10 25 50 15

30

11 150

150 12 20

20

13 13 17 18

17.50 14 30 50

40

15 60

60 16 64

64

17 60 90

75 18 35 41 35 35

36.50

19 55

55 20 101

101

21 55 63

59 22 150

150

23 30

30 24 38 26 46 56 40

41.20

25 40

40 26 50

50

27 49 33 36

39.33 28 49 36 43 33

40.25

29 690 290 30 42

42

31 38 46 26 30 43 43

37.67 32 60

60

33 60

60 34 44 57 26 33 44

40.80

35 60

60 36 20

20

37 60 125

92.50 38 30

30

39 75

75 40 75

75

41 20

20 42 120

120

43 137

137 General mean 64.77

31

Table 2-3. Sample sizes in the reviewed articles Article

No. sample size in paper sample size mean/paper

sample size <500

500 < sample size <1000

sample size >1000

set 1 set 2 set 3 set 4 set 5 set 6

1 700 300

500.00 1 1 2 200 500

350.00 1 1

3 8912

8912

1 4 3866

3866

1

5 1210 589 352

717.00 1 1 1 6 4000

4000

1

7 500 1000 2000 5000

2125.00 1 1 1 8 1000

1000

1

9 500 2000 8000

3500.00 1 1 10 2000 5000

3500.00

1

11 2000

2000

1 12 5000

5000

1

13 300 500

400.00 1 1 14 1000 1392

1196.00

1 1

15 2000

2000

1 16 570 499 522 495

521.50 1 1

17 1000

1000

1 18 8026 8494

8260.00

1 1

19 3000 1000 5000

3000.00 1 1 20 1000 266

633.00 1 1

21 1996

1996

1 22 466

466 1

23 663 632 537 680 653 561 621.00

1 24 663 632 537 680 653 561 621.00

1

25 544

544

1 26 1000

1000

1

27 985 629 914 682 666 1000 812.67

1 28 1000

1000

1

29 485

485 1 30 3000

3000

1

31 100

100 1 32 1040

1040

1

33 4028

4028

1 34 4028

4028

1

35 500

500

1 36 10 15 25 50

25.00 1

32

Table 2-3. Continued Article

No. sample size in paper sample size mean/paper

sample size <500

500 < sample size <1000

sample size >1000

37 3000

3000

1 General mean

2047.22 12 18 20

Standard Deviation

2105.86 32.43% 48.65% 54.05%

General median 681

33

Table 2-4. - Fit indices in reviewed articles

Articles Bias RMSE Reliability coefficient

loglikelihood ratio test WMSE AIC

Other index

42 5 17 9 17 2 1 20

Percentage 11.90% 40.48% 21.43% 40.48% 4.76% 2.38% 47.62%

Table 2-5. Estimation method in reviewed articles

Estimation method MML MCMC total

Articles 24 5 29

percentage 82.76% 17.24%

Table 2-7. The number of simulation replication applied in the reviewed articles

Replication number 10 100 200 600 1000 Total

Frequency 1 5 2 2 1 11

9.09% 45.45% 18.18% 18.18% 9.09%

34

CHAPTER 3 METHOD

A comprehensive review of the testlet research from 1989 to 2009 provides us a

systematic framework for exploring the performance of three different IRT models to

analyze testlets. These three models will be a part of two studies presented in this

paper. The first is a series of simulation studies designed to investigate the extent to

which the fluctuation of testlet conditions (testlet size, local dependence effects, etc.)

influence the different model fitting results. Simulations are conducted to evaluate model

fit, test reliability, and parameter recovery of the three different IRT models. Next, a real

data analysis of the COMLEX-USA exam dataset is presented by fitting different models

as an empirical case. The three one parameter IRT models adopted in the study are:

the Rasch model, the Partial Credit Model, and the Rasch testlet model.

3.1 Model Used to Generate Data

The current study evaluates the effect of changes in the local effect of testlets on

the model fit, ability parameter recovery, and test reliability of three different IRT

models. In order to quantify the extent of the local effect, the application of Rasch testlet

model is appropriate for research data simulation. The Rasch testlet model (Wang&

Wilson, 2005) includes a testlet parameter ( jid )( ) which is the random effect capturing

the interaction of person j with testlet )(id when the overarching latent trait is held

constant. According to the definition of the testlet, the sum of testlet parameters ( jid )( )

over examinees within any testlet is zero

( ),0(~ 2

)( )(idNjid ). Thus, the local effects of testlets in the Rasch Testlet model

are simulated from the normal distribution with a mean of zero, and standard deviation

35

of the square root of the given local effect values ( 2

)( id ). The following prior model

constraints are used to simulate the responses.

With v=1,…,V and V the total number of examinees,

01

)(

D

d

ivd for all v = 1,…,V. (3-1)

0),( )( ivdV for all Dd ,...,1 (3-2)

0),( )()( jvdivd for all Dd ,...,1 (3-3)

3.2 Population Parameters

Population item parameters for the Rasch testlet model and the ability parameters

for the population are simulated from the normal distribution with the mean of zero, and

standard deviation of one (i.e. )1,0(~ Nj within a range from negative three to positive

three; ]3,3[ ). For each condition, the population item difficulty parameters are generated

from the mean of zero, and standard deviation of one ( )1,0(~ 2 bbi Nb ) with a

range of ]3,3[ . For simplicity, all simulated population parameters are rounded to three

decimal places. The population item parameters and population ability parameters are

randomly drawn from these two normal distributions ahead of each condition.

3.3 Condition Manipulated

In this study, we examine whether fluctuations of testlet size, local dependence

effects, and item difficulty within testlets affect the reliabilities and the model fit of three

different IRT models. Our study is a four-factor completely crossed design:

2 (changes in testlet size)4 (levels of local dependence effect) 3 (ratio of testlet

items and general items in test)3 (sample size). Table 3-1 demonstrates all the 72

36

conditions and the interactions of these four factors effect on the testlet research

designs.

1. The first factor is the testlet size. The testlet sizes chosen for this study are based on the purpose of the study and the sizes less often discussed in the applied literature. Thus, two patterns of testlet size including small and medium

testlet sizes are used in this study: )( 53, .

2. The second factor is the local dependence effect. Local dependence effects from the ten reviewed studies are within the range of zero to one (Wainer & Wang, 2000; Wang, 1999; Wang, 2002; Wang, 2005; Habing & Roussos, 2003; Adams, Wilson & Wang, 1997; Wang & Wilson, 2005; DeMars, 2006; Li, 2005; Zenisky, Hambleton & Sireci, 2002). Therefore, four levels of local dependence effect will

be examined: )(2 10.75,0.5,0.25, .

3. The third factor is the ratio of testlet items to general items in the test. Among all 60 items in the test, the ratio of testlet items and general items will be

1):31,:13,:(1 .

4. The fourth factor is the sample size of the examinees. Of seventy-four study groups in forty-five different articles from the applied literature, the distribution of sample size ranged from 10 to 8912; with two sample sizes greater than 8000 and four sample size smaller than 50. By dividing the remaining sixty-eight sample sizes into three groups according to the size ranking; and taking the approximate mean value of the sample sizes in each category, we selected three sample sizes for use in this study: ( 1000500,250, ). These quantities represent

rounded approximations of the most common sample size found in the applied literature.

5. Test length is the other issue that must be considered ahead of the research design. The test length of this simulation is set to sixty (60 items per test), the approximate general mean of the test length among the reviewed testlet literature.

6. For each condition, based on the largest occurrence of the iteration times in the applied literature, the value of the replication time is selected. Thus, one hundred replications are applied with each condition.

3.4 Data Generation

The Rasch testlet model response data are generated using the statistical

software R 2.10. Response data were generated for 100 samples from a set of

population item parameters (60 items) and population ability parameters (1000 trait

37

value j ) for each condition. Local effects were given per each testlet accordingly. Each

simulee was assigned a known trait value j from the randomly selected population

ability parameters. By comparing the difference between the co-effect of local effects

within testlets plus the randomly selected population item parameters and the known

trait value j from each simulee, the probability of observing the response matrix

),...,( 1 NxxX from a sample of N independently responding examinees can be

represented as

),,|(),,|(),|( )()( jidji

i j

i j

i

jidii xPxPXP (3-4)

where N ,...,( 1 ), ),...,( 1 Jbb , and jid )( are all considered unknown, fixed

parameters.

Thus, a response matrix with all logical indicators was generated for each

replication within every condition. Then, a series of random numbers were given from a

uniform distribution that ranged from 0 to 1 to match the logical response matrix

accordingly. If the known trait value j is less than the co-effect of the item and testlet,

the logical indicator is false setting the simulee‟s response to 0. On the other hand, if the

known trait value j is larger than the co-effect of the item and testlet, the logical

indicator is true which sets the simulee‟s response to 1 This process repeats for every

item and every simulee in each of the 100 samples. Thus, 100 simulated responses are

generated for each condition accordingly.

3.5 Parameter Estimation

In the study, the parameters of the dataset in 3 different models (PCM, standard

Rasch model, and Rasch Testlet model) are analyzed using Marginal Maximum

38

Likelihood (MML) methods with ConQuest Version 2.0. The most frequently used

approaches to item parameter estimation for unknown trait levels are Joint Maximum

Likelihood (JML), Conditional Maximum Likelihood (CML), and Marginal Maximum

Likelihood (MML). Holland (1990) compared the different sampling theory foundations of

these three ML methods.

CML is possible only for the 1PL- model and is so computationally intensive as to

be impractical in many situations. JML has been used extensively in early IRT

programs. However, JML estimation also has some drawbacks for estimating IRT

models. First, the JML item parameter estimates are biased and inconsistent for fixed

length tests. Second, the JML standard errors are probably too small to handle the

unknown person trait level (Holland, 1990).

The most commonly used method for estimating the parameter of IRT models is

Marginal Maximum Likelihood (MML). In MML estimation, unknown trait levels are

estimated by expressing the response pattern probabilities as expectations from a

population distribution. MML has several advantages over the other two ML methods.

First, MML is applicable for all types of IRT models. Second, MML is efficient for tests

with different lengths. Third, the MML estimate of item standard errors may be justified

as good approximations of expected sampling variance of the estimates. Fourth,

estimates are available for perfect scores. In the previous literature, Marginal Maximum

Likelihood (MML) method is applied in 82.76% of the articles. Therefore, MML was

chosen for the parameter estimation for this study.

39

The simplified mechanism of MML is shown below. The prior knowledge about the

examinee distribution ( )(p ) is treated as a prior and the item difficulty parameter is

indicated as . That is, MML estimates of the item difficulty parameter ( ) maximize

i

i dpxpXL )(),|()|( (3-5)

Therefore, a posterior distribution ( )|( Xp ) is obtained for item parameters by

multiplying )|( XL by )(p (Mislevy, 1986):

)()|()|( pXLXp (3-6)

3.6 Ability Estimation

In the study, the simulees‟ performance on a test is scored based on their

responses to items and the IRT models. The estimation of the simulees‟ abilities is

performed by two different approaches in this study. These two approaches are

Maximum Likelihood Estimation (MLE; Lord, 1980) and Expected a Posteriori

Estimation (EAP; Bock & Mislevy, 1982).

The maximum likelihood estimation (MLE) is the most commonly used estimation

procedure for examinees‟ ability estimation. Based on the examinee‟s responses on the

test, MLE finds the value of the latent trait that maximizes the likelihood of an item

response pattern by holding the assumption that the item parameter values are known.

The likelihood of the latent trait , given an item response pattern ),...,,( 21 ixxx is denoted

as

I

i

iiix

PxxxL1

21 )()|,...,,( (3-7)

40

where )(iP represents the probability of a given response to item i and the number

I is the number of items in the test. Although, MLE is the most common approaches for

ability estimation, some drawbacks of MLE must be addressed. First, MLE is not

available for all-endorsed or all-not-endorsed item response patterns. If these two item

patterns exist, the results of MLE will go to infinity. Second, MLE may not converge

when some response patterns are abnormal (Bock & Mislevy, 1982).

Expected a posteriori (EAP) estimation is an efficient approach for examinee‟s trait

estimation. EAP is a Bayesian estimator with non-iterative process. Unlike the MLE,

EAP provides a finite estimation for all-endorsed or all-not-endorsed item response

patterns. In fact, EAP estimation indicates the mean of the posterior distribution. For

any test, a set of quadrature nodes ( rQ ) are defined for a fixed number of specified trait

. There is a probability density )( rQW corresponding to each quadrature node. The

EAP trait estimate is derived by

)]}()([{

)]()([

1

1

rr

N

r

rrr

N

r

QWQL

QWQLQ

(3-8)

where the )( rQL represents the exponent of the log-likelihood function evaluated at

each of the N quadrature nodes. However, some shortcomings of EAP should be

mentioned. First, there is a tendency for Bayesian estimates to regress toward the

mean of the prior distribution (Kim & Nicewander, 1993; Weiss, 1982). Since ConQuest

provides EAP estimates for both with- and without- regression, the EAP estimates

without regression were applied. The other shortcoming of EAP is that its estimation

accuracy is reduced by an improper prior distribution (Bock & Mislevy, 1982). Since

both MLE and EAP ability estimation approaches have their pros and cons,

41

respectively, the estimation of the simulees‟ abilities in this study is performed by both

methods simultaneously.

3.7 Analysis

In this study, each simulated data set was analyzed using ConQuest. Since, the

polytomous test response patterns from partial credit model are different from the

dichotomous test response patterns, the observed data are different from these two

types of models (i.e. polytomous IRT model, dichotomous IRT model). Thus, using the

loglikelihood ratio test and Akaike's information criterion (AIC) as measures of the

goodness of fit of model is inappropriate. Therefore, the accuracy of estimation for

ability parameters with regard to three different models was quantified via bias and root

mean square error (RMSE) across all replications. The local item dependences for the

real data were examined by the conditional nonparametric tests (1T ; Ponocny, 2001).

The test reliability coefficients were also calculated for the simulated data and the real

data.

3.7.1 Bias

Bias is defined as average difference in true and estimated parameters across all

people and items. An estimate of bias is calculated for each replication in each

condition, and an average bias of each condition in the simulation. Bias is

mathematically defined as:

n

bias

n

j

jj

1

ˆ

(3-9)

where the j is the true value of a item or person parameter;

j is the estimated value of that parameter ;

42

n is the total instances of that type of parameter within a replication (i.e. sample

size for ability ).

3.7.2 Root Mean Square Error (RMSE)

RMSE is a measure of absolute accuracy in parameter estimation. RMSE is

calculated for each parameter type in a replication, and an average for each condition is

found within each condition. RMSE is the square root of the average squared difference

between estimated and true parameters, and is mathematically defined as:

n

RMSE

n

j

jj

1

2)ˆ(

(3-10)

where terms in the equation are defined as they are with bias.

3.7.3 Reliability

In this study, test reliability coefficients were computed for item responses scored

dichotomously for both Rasch testlet model and standard Rasch model as well as item

responses scored polytomously for Partial Credit Model. As we use MML estimation in

ConQuest, the test reliability can be calculated as

)ˆ(

).()ˆ(

)(

)(Re

2

2

ˆ

2

S

esS

Var

VarliabilityTest

EAP

T

(3-11)

43

Table 3-1. Study design condition with 3 factors

Condition

Testlet size 5 Testlet size 3

sample size

Testlet Number

Local effect Condition

Sample size

Testlet Number

Local effect

1 1000 9 0.25 37 1000 15 0.25

2 0.5 38 0.5

3 0.75 39 0.75

4 1 40 1

5

6 0.25 41

10 0.25

6

0.5 42

0.5

7

0.75 43

0.75

8

1 44

1

9 3 0.25 45 5 0.25

10 0.5 46 0.5

11 0.75 47 0.75

12 1 48 1

13 500 9 0.25 49 500 15 0.25

14

0.5 50

0.5

15

0.75 51

0.75

16

1 52

1

17 6 0.25 53 10 0.25

18 0.5 54 0.5

19 0.75 55 0.75

20 1 56 1

21

3 0.25 57

5 0.25

22

0.5 58

0.5

23

0.75 59

0.75

24

1 60

1

25 250 9 0.25 61 250 15 0.25

26 0.5 62 0.5

27 0.75 63 0.75

28 1 64 1

29

6 0.25 65

10 0.25

30

0.5 66

0.5

31

0.75 67

0.75

32

1 68

1

33 3 0.25 69 5 0.25

34 0.5 70 0.5

35 0.75 71 0.75

36 1 72 1

44

CHAPTER 4 RESULTS

4.1 MLE Non-convergence Issue

In using the MLE and EAP estimation methods to estimate the simulees‟ abilities,

a large number of non-convergence cases arose in the Rasch testlet model results via

MLE estimation in the 1000 sample size condition. Additionally, such non-convergence

case pattern is also occurred across other sample sizes. The number and percentage of

non-convergence cases is displayed in Tables 4-1 and 4-2. After checking the non-

convergence cases response patterns, neither non-endorsed nor all-endorsed response

patterns were found. Therefore, this phenomenon may occur because of the complexity

of the multi-dimensionality of the Rasch testlet model. So, for precision purposes, only

EAP estimate results are used in this study.

4.2 Test Reliability

A summary of the test reliability analyses is presented in Tables 4-3 and 4-4.

Three columns of estimates are provided for each model of each condition. For most of

the conditions, the reliability estimates from standard Rasch model are higher than the

reliability estimates from both the Partial Credit model and the Rasch testlet model. The

association between test reliability and other factors are described as below.

First, the difference in test reliability estimates between the standard Rasch model

and the other two models indicates a strong association between the ratio of the

independent items to testlet items within a test andtest reliability overestimation. In

general, the magnitude of the test reliability analyzed from standard Rasch model is

higher than its corresponding coefficient from the other two models (from 0.01 to 0.08).

As the ratio of the independent/testlet items within a test decreases (i.e. a greater

45

proportion of testlet items are included in a test), the extent of test reliability

overestimation increases due to ignoring the local item dependence. This phenomenon

occurs because of the existence of locally dependent testlet items and results in the

overestimation of the test reliability estimates since the standard Rasch model assumes

items within a test are locally independent. Second, the difference in test reliability

estimates between the standard Rasch model and the other two models across different

sample sizes indicates a strong association between the sample size and test reliability

overestimation. As the sample size increases, the extent of test reliability overestimation

increases as well. No evident patterns were found to disclose the association between

test reliability local effect and testlet size.

As for the test reliability comparison between Partial Credit model and the Rasch

testlet model, no obvious differences were observed between the test reliability

estimates computed from these two models when the testlet size was held to three

items. Moreover, as the testlet size was set to five, under most of the circumstances,

the test reliability estimates from the Partial Credit model are slightly smaller than their

corresponding estimates from the Rasch testlet model, but the differences are generally

smaller than 0.01. This is because more parameters are dropped from the polytomous

model compared to the individual item-scoring models as the testlet size increases,

thereby decreasing the effective test length and decreasing the estimation of the test

reliability (Zenisky, Hambleton & Sireci, 2002). In addition, any association between the

magnitude of local item effect and the variation of the test reliability estimates was not

obvious in the results from this study.

46

The results of the Spearman-Brown prophecy are listed in Tables 4-5 and 4-6. The

values of Spearman-Brown prophecy from three models also indicated the effect of the

reliability overestimation by using standard Rasch model. If the test administration

claims that the testlet-based test satisfies some required test reliability level by using

overestimated test reliability coefficient from the standard Rasch model, these results

provide an estimate of the amount by which a testlet-based test would need to be

lengthened to achieve the same magnitude as the overestimated test reliability as the

standard Rasch model is applied. For sample size 1000 conditions, approximately over

4 times the test length increases (from 3.984 to 5.138) in the testlet-based test (i.e.

PCM, Rasch testlet model) would be needed to achieve the level of test reliability

(overestimated) indicated by applying the standard Rasch model. As the sample size

decreases to its half size (500), the magnitude to increase the test length to achieve the

overestimated test reliability is down to half as well. As the sample size decreases to its

quarter size (250), the magnitude to increase the test length to achieve the

overestimated test reliability is minimum but still positive.

4.3 Standard Error of Measurement

The magnitude of standard error of measurement (SEM) for three different models

was also used for model comparison in this study. Tables 4-7 and 4-8 list the mean of

SEM for all 72 conditions over 100 replications. In general, the values of mean SEM

obtained from standard Rasch model were smaller than the values of mean SEM

obtained from the other two models (i.e. Partial credit mode, Rasch testlet model), but

the differences were generally smaller than 0.02. This phenomenon occurs because

ignoring the local dependency within a testlet leads to an underestimate of the standard

errors.

47

The magnitude of the SEM underestimation might be influenced by different testlet

sizes. From the Tables 4-7 and 4-8, even holding the same level of the

independent/testlet item ratio, having a larger testlet size (i.e. testlet size 5), on average,

led to a larger extent of underestimation in SEM than having a smaller testlet size (i.e.

testlet size 3) circumstance. The quantitative differences of the SEM affected by the

tesetlet size difference were generally around 0.01. No obvious associations were

observed between the SEM estimates and the local effect variations across conditions.

4.4 Bias and RMSE

Tables 4-9 and 4-10 list the mean of the bias estimates and the mean of the

RMSE estimates for all 72 conditions over 100 replications. There are three sets of

estimates in each table corresponding to three different models (i.e. standard Rasch

model, partial credit model, Rasch testlet model) for the testlet size three and five

conditions. The means of RMSE (within a range from -0.01to 0.01) over all conditions

for testlet sizes 3 and 5 are small for the three different models, especially when

compared with the ability range of negative three to positive three. As found for the

three models, the magnitude of RMSE estimates is fairly satisfactory.

However, throughout the entire ability interval (i.e. [-3.0, 3.0]), the magnitudes of

the bias of all three models are relatively high for some conditions. In general, no

obvious associations were found between the testlet size and the magnitude of bias

estimates. The association between the ratio of the independent/testlet item within a

test (i.e. the number of the testlets) and the bias estimates were not found either.

In order to reveal how bias and RMSE changes as a function of ability variation,

the ability range is split into 6 intervals and the bias and RMSE estimates are calculated

accordingly. Table 4-11 to 4-16 display the mean bias estimates of ability ( ) estimate

48

recovery (i.e. EAP estimate) with 6 different ability intervals for three different models

overall 72 conditions. According to the results listed in the tables, relatively high

magnitude of positive bias was observed at the lowest ability interval level ( 0.2 ) for

all three models across all conditions. Meanwhile, relatively high magnitude of negative

bias was also found at the highest ability interval level ( 0.2 ) for all three models

(i.e. standard Rasch model, partial credit model, Rasch testlet model) across all

conditions. Since applying EAP estimation may result in the ability estimate distribution

leaning towards its mean, a possible cause for this high magnitude of bias at both ends

of the ability intervals might be the usage of the EAP estimates. Other than that high

magnitude of bias at both ends of the ability interval phenomena, no obvious patterns

and associations between mean bias variations and the major factors in this study were

found across three models.

In addition, Table 4-17 to Table 4-22 display the RMSE estimates of ability ( )

estimate recovery with 6 different ability intervals for three different models (i.e.

standard Rasch model, partial credit model, Rasch testlet model) overall 72 conditions.

Similar to the bias estimates, except for that relatively high magnitude of RMSE

estimates at both ends of the ability intervals, no obvious patterns and associations

between RMSE estimate variations and the major factors in this study were found

across three models either.

In sum, all three models (i.e. standard Rasch model, partial credit model, Rasch

testlet model) performed fairly well in ability estimates recovery on the basis of the

relatively low magnitude of bias and RMSE estimates from the analysis results.

49

4.5 An Empirical Case

The National Board of Osteopathic of Medical Examiners (NBOME) offers

computer-based COMLEX-USA exams online. This computer-based exam series is

designed to assess the osteopathic medical knowledge and clinical skills considered

essential for osteopathic generalist physicians to practice medicine without supervision.

The COMLEX-USA exam responses have been analyzed with the standard Rasch IRT

Model. The 2008 National Board of Osteopathic of Medical Examiners (NBOME)

COMLEX-USA Level-2 exam data is used as an empirical case for this study. The

COMLEX-USA level-2 exam consists of 350 items in 7 blocks including 141

independent items and 209 testlet items grouped in 95 testlets (all testlet sizes are

within 2-4 items). The item type is identified (i.e. A -single item, D-single Item with

graph, B-matching item, S-testlet item, F-testlet item with graph). The B, S, and F type

items are categorized as testlet items. Among all 95 testlets, there are 4 testlets with

matching items and 9 testlets with a graph. The testlet sizes range from 2 to 4. A total

of 450 examinees were included in the examinee population. No missing data exists.

The data of the first block of this exam (Block-1) is used for this study. Block-1 data

contains 50 items including 27 independent items and 23 testlet items within 10 testlets.

The data set was analyzed using the standard Rasch model, the Partial Credit

model, and the Rasch testlet model, separately. Table 4-23 lists the weighted mean

square errors (WMSE) for the 50 items in three models. In the output of Rasch testlet

model, the WMSE ranges from 0.86 to 1.19 ( 06.0,04.1 SDM ). The two items with

the most extreme WMSE in the Rasch testlet model are item 47 ( 86.0WMSE ) and

item 5 ( 19.1WMSE ) with a non-significant p-value (i.e. the item with 96.1WMSE is

50

treated as an item with a bad fit). That is, the item fit for all 50 items in this block is

acceptable. In the output of Partial Credit model, the WMSE ranged from 0.97 to 1.07 (

02.0,01.1 SDM ). The two items with the most extreme WMSE in the partial credit

model are item 40 ( 97.0WMSE ) and item 35 ( 07.1WMSE ). In the output of the

standard Rasch model, the WMSE ranges from 0.97 to 1.18 ( 03.0,01.1 SDM ). The

two items with the most extreme WMSE in the standard Rasch model are item 8 (

97.0WMSE ) and item 37 ( 07.1WMSE ). All these item WMSE estimates from three

models indicate that these 50 items have an objectively fair item fit. Therefore, we

should keep them in the test.

The estimates of test reliability for the overarching latent trait are 0.899 for the

Rasch testlet model, 0.909 for the Partial Credit model, and 0.936 for the standard

Rasch model. Thus, the standard Rasch model appears to overestimate the test

reliability due to its ignorance of the local item dependence within testlet.

The Spearman-Brown prophecy formula ( )1()1( *

''

*

' xxxxxxxxN ) is used to

compute how much the test length is expected to increase to achieve the standard

Rasch model‟s overestimated test reliability (0.936) for the Rasch testlet model and

Partial Credit model. For the Rasch testlet model, the test length would have to be

increased approximately 63.65% (32 items) to achieve the overestimated test reliability.

For the Partial Credit model, the test length would have to be increased approximately

47.22% (24 items) to achieve the degree of the overestimated test reliability.

NBOME COMLEX-USA exam has been analyzed to detect its local item

dependence by applying the 3Q statistic before (Shen & Yen, 1997). In this study, the

local item dependence detection was conducted by using „NPtest‟ command in „eRm‟

51

package in R instead (Mair &Hatzinger, 2007). The „NPtest‟ is a function to perform

nonparametric Rasch model tests, proposed by Ponocny (2001). The implemented

method we used is the method "T1" to check for local dependence via increased inter-

item correlations. For all item pairs cases are counted with equal responses on both

items. The significant level 0.01 was applied for „NPtest-T1” test in this study. In this 50

item test block, there are 1,225 possible item pairs. Among all 1,225 possible item pairs,

27 item pairs are detected to have significant local item dependences between them.

Results are provided in Table 4-24. For those items within the testlets, the local item

dependences are evident (e.g., item pair-28-30; item pair -28-31; item pair-47-48, etc.).

Fourteen out of total twenty-seven item pairs (51.85%) in which the local item

dependence exists belong to items within testlets.

An overall test statistic ( 11T ) for the local dependence of test is given by using the

nonparametric Rasch model tests (Ponocny, 2001). The global test of the local

dependence for the entire test block is conducted via option “ 11T ”. By summing up the

absolute deviation from the expected value ij to all inter-item correlations ijr in the test,

the one-side p-value of this 50 item block test is 0.371 (significant level 0.05) which

indicates that the global test of local dependence is non-significant and this test (i.e.

NBOME COMLEX-USA level-2 exam block 1) holds the local independence assumption

for the entire test block.

In sum, the partial credit model and the Rasch tesetlet model are the better model

choices to analyze NBOME COMLEX exams. The data from NBOME COMLEX-USA

level-2 exam block 1are better modeled using PCM and Rasch testlet model than the

standard Rasch model. In addition, the test reliability discrepancy between the PCM and

52

the Rasch testlet model to analyze NBOME COMLEX data is within the range of 0.01,

but the test reliability discrepancy between the standard Rasch model and the other two

models to analyze NBOME COMLEX data is approximately over 0.04. This result also

supports that PCM and the Rasch testlet model are the better model choices to analyze

NBOME COMLEX exams.

53

Table 4-1. MLE nonconvergence case and rate per condition-testlet size 3

Testlet Size 3

condition Sample size Testlet No. Local effect Nonconvergence Case Percentage

37 1000 15 0.25 8469 8.47%

38

0.5 9701 9.70%

39

0.75 8010 8.01%

40

1 9955 9.96%

41

10 0.25 1216 1.22%

42

0.5 1007 1.01%

43

0.75 526 0.53%

44

1 680 0.68%

45

5 0.25 45 0.05%

46

0.5 131 0.13%

47

0.75 29 0.03%

48 1 99 0.10%

Table 4-2. MLE nonconvergence case and rate per condition-testlet size 5

Testlet Size 5

condition Sample

size Testlet No. Local effect Nonconvergence

Case Percentage

1 1000 9 0.25 3002 3.00%

2

0.5 3658 3.66%

3

0.75 2406 2.41%

4

1 3016 3.02%

5

6 0.25 289 0.29%

6

0.5 551 0.55%

7

0.75 262 0.26%

8

1 376 0.38%

9

3 0.25 26 0.03%

10

0.5 38 0.04%

11

0.75 40 0.04%

12 1 42 0.04%

54

Table 4-3. Test reliability-testlet size 3 conditions

testlet size 3

Condition Sample

size Testlet

No. Local effect

Testlet model Partial Credit Standard Rasch

1 1000 15 0.25 0.90631 0.90382 0.97843

2

0.5 0.90521 0.90474 0.97785

3

0.75 0.89974 0.90513 0.97649

4

1 0.90663 0.90466 0.97828

5

10 0.25 0.90406 0.90360 0.97821

6

0.5 0.90201 0.90441 0.97772

7

0.75 0.89167 0.90630 0.97463

8

1 0.90059 0.90464 0.97725

9

5 0.25 0.90274 0.90359 0.97809

10

0.5 0.89987 0.90364 0.97740

11

0.75 0.90053 0.90388 0.97756

12

1 0.89820 0.90425 0.97692

13 500 15 0.25 0.89851 0.90440 0.95188

14

0.5 0.91206 0.90303 0.95908

15

0.75 0.89450 0.90663 0.94967

16

1 0.89666 0.90578 0.95095

17

10 0.25 0.90822 0.90235 0.95873

18

0.5 0.90155 0.90453 0.95481

19

0.75 0.89423 0.90583 0.95092

20

1 0.89932 0.90444 0.95408

21

5 0.25 0.90015 0.90488 0.95395

22

0.5 0.88961 0.90540 0.94962

23

0.75 0.90373 0.90395 0.95599

24

1 0.89208 0.90526 0.95064

25 250 15 0.25 0.91113 0.90304 0.91925

26

0.5 0.89811 0.90557 0.91223

27

0.75 0.90328 0.90645 0.90647

28

1 0.89212 0.90576 0.90664

29

10 0.25 0.89754 0.90427 0.90634

30

0.5 0.89718 0.90453 0.90629

31

0.75 0.89223 0.90594 0.90972

32

1 0.90038 0.90551 0.90789

33

5 0.25 0.90464 0.90451 0.91279

34

0.5 0.89704 0.90466 0.90720

35

0.75 0.89793 0.90554 0.90664

36 1 0.90469 0.90347 0.91438

55

Table 4-4. Test reliability-testlet size 5 conditions

testlet size 5

condition Sample

size Testlet

No. Local effect Testlet model Partial Credit Standard Rasch

1 1000 9 0.25 0.90040 0.89809 0.97711

2

0.5 0.90295 0.89787 0.97770

3

0.75 0.90271 0.89796 0.97708

4

1 0.90099 0.89850 0.97669

5

6 0.25 0.89223 0.89873 0.97540

6

0.5 0.90099 0.89812 0.97761

7

0.75 0.88920 0.90030 0.97430

8

1 0.90349 0.89769 0.97830

9

3 0.25 0.90076 0.89734 0.97780

10

0.5 0.89936 0.89840 0.97718

11

0.75 0.89962 0.89817 0.97735

12

1 0.89177 0.89918 0.97513

13 500 9 0.25 0.89885 0.89767 0.95250

14

0.5 0.89788 0.89884 0.95185

15

0.75 0.89535 0.89899 0.95086

16

1 0.90691 0.89752 0.95719

17

6 0.25 0.90195 0.89698 0.95603

18

0.5 0.89905 0.89893 0.95362

19

0.75 0.90251 0.89752 0.95585

20

1 0.90697 0.89763 0.95755

21

3 0.25 0.89304 0.89841 0.95112

22

0.5 0.90059 0.89817 0.95487

23

0.75 0.89500 0.89832 0.95236

24

1 0.90214 0.89692 0.95632

25 250 9 0.25 0.90941 0.89648 0.91701

26

0.5 0.89676 0.89995 0.90275

27

0.75 0.90618 0.89721 0.91367

28

1 0.90682 0.89774 0.91279

29

6 0.25 0.91078 0.89575 0.92096

30

0.5 0.90108 0.89767 0.91065

31

0.75 0.90156 0.89722 0.91182

32

1 0.88919 0.89977 0.90674

33

3 0.25 0.90417 0.89557 0.92314

34

0.5 0.89564 0.89826 0.90485

35

0.75 0.88549 0.90013 0.90393

36 1 0.89959 0.89755 0.90990

56

Table 4-5. Testlet size 3 the results of the Spearman-Brown prophecy

testlet size 3

Condition Sample

size Testlet

No. Local effect

Spearman-Brown (Testlet) Spearman-Brown (Partial Credit)

1 1000 15 0.25 4.689 4.827

2

0.5 4.623 4.648

3

0.75 4.628 4.353

4

1 4.639 4.747

5

10 0.25 4.764 4.789

6

0.5 4.767 4.638

7

0.75 4.667 3.972

8

1 4.742 4.528

9

5 0.25 4.810 4.763

10

0.5 4.812 4.612

11

0.75 4.812 4.633

12

1 4.797 4.482

13 500 15 0.25 2.234 2.091

14

0.5 2.260 2.517

15

0.75 2.225 1.943

16

1 2.234 2.017

17

10 0.25 2.348 2.514

18

0.5 2.307 2.230

19

0.75 2.292 2.014

20

1 2.326 2.195

21

5 0.25 2.298 2.178

22

0.5 2.339 1.969

23

0.75 2.314 2.308

24

1 2.330 2.016

25 250 15 0.25 1.110 1.222

26

0.5 1.179 1.084

27

0.75 1.038 1.000

28

1 1.174 1.010

29

10 0.25 1.105 1.024

30

0.5 1.108 1.021

31

0.75 1.217 1.046

32

1 1.091 1.029

33

5 0.25 1.103 1.105

34

0.5 1.122 1.030

35

0.75 1.104 1.013

36 1 1.125 1.141

57

Table 4-6. Testlet size 5 the results of the Spearman-Brown prophecy

testlet size 5

condition Sample

size Testlet

No. Local effect Spearman-Brown (Testlet) Spearman-Brown (Partial Credit)

1 1000 9 0.25 4.722 4.844

2

0.5 4.712 4.987

3

0.75 4.594 4.844

4

1 4.604 4.733

5

6 0.25 4.789 4.468

6

0.5 4.798 4.953

7

0.75 4.724 4.198

8

1 4.816 5.138

9

3 0.25 4.853 5.039

10

0.5 4.792 4.843

11

0.75 4.815 4.892

12

1 4.759 4.396

13 500 9 0.25 2.257 2.286

14

0.5 2.248 2.225

15

0.75 2.262 2.174

16

1 2.295 2.553

17

6 0.25 2.364 2.497

18

0.5 2.309 2.312

19

0.75 2.339 2.472

20

1 2.314 2.573

21

3 0.25 2.331 2.200

22

0.5 2.336 2.399

23

0.75 2.345 2.263

24

1 2.375 2.516

25 250 9 0.25 1.101 1.276

26

0.5 1.069 1.032

27

0.75 1.096 1.213

28

1 1.075 1.192

29

6 0.25 1.141 1.356

30

0.5 1.119 1.162

31

0.75 1.129 1.185

32

1 1.212 1.083

33

3 0.25 1.273 1.401

34

0.5 1.108 1.077

35

0.75 1.217 1.044

36 1 1.127 1.153

58

Table 4-7. Mean standard error of measurement for each condition (testlet size 3)

condition Sample

size Testlet

No. Local effect Testlet PC Rasch

37 1000 15 0.25 0.300381 0.305033 0.289589

38

0.5 0.296793 0.303545 0.288309

39

0.75 0.295281 0.302921 0.287677

40

1 0.291839 0.303677 0.288298

41

10 0.25 0.301639 0.305362 0.289899

42

0.5 0.299473 0.304075 0.288841

43

0.75 0.298098 0.301055 0.285724

44

1 0.296793 0.303715 0.288325

45

5 0.25 0.303638 0.305379 0.290094

46

0.5 0.303610 0.305306 0.289880

47

0.75 0.302271 0.304918 0.289561

48

1 0.301956 0.304322 0.288957

49 500 15 0.25 0.301526 0.304080 0.288464

50

0.5 0.297063 0.306254 0.290488

51

0.75 0.293822 0.300528 0.285145

52

1 0.293095 0.301892 0.286377

53

10 0.25 0.302328 0.307330 0.291530

54

0.5 0.299370 0.303877 0.288227

55

0.75 0.297841 0.301809 0.286284

56

1 0.297281 0.304030 0.288398

57

5 0.25 0.302365 0.303326 0.287746

58

0.5 0.303011 0.302495 0.286941

59

0.75 0.301661 0.304797 0.289131

60

1 0.301875 0.302717 0.287157

61 250 15 0.25 0.301321 0.306252 0.291301

62

0.5 0.297011 0.302224 0.286611

63

0.75 0.293531 0.300802 0.285815

64

1 0.294556 0.301925 0.286726

65

10 0.25 0.302382 0.304287 0.288809

66

0.5 0.300390 0.303883 0.288478

67

0.75 0.298503 0.301642 0.286330

68

1 0.295590 0.302319 0.286844

69

5 0.25 0.301868 0.303920 0.288364

70

0.5 0.302962 0.303675 0.288801

71

0.75 0.300649 0.302262 0.287081

72 1 0.301612 0.305566 0.290187

59

Table 4-8. Mean standard error of measurement for each condition (testlet size 5)

condition Sample

size Testlet

No. Local effect Testlet PC Rasch

1 1000 9 0.25 0.299785 0.304136 0.275109

2

0.5 0.295323 0.304440 0.275416

3

0.75 0.292322 0.304307 0.275406

4

1 0.289530 0.303489 0.274878

5

6 0.25 0.302390 0.303154 0.274489

6

0.5 0.298983 0.304078 0.275321

7

0.75 0.297526 0.300803 0.272350

8

1 0.296710 0.304706 0.275876

9

3 0.25 0.303190 0.305241 0.276391

10

0.5 0.301505 0.303647 0.274911

11

0.75 0.301083 0.303993 0.275250

12

1 0.300625 0.302478 0.273795

13 500 9 0.25 0.300613 0.304737 0.275479

14

0.5 0.295684 0.303001 0.273933

15

0.75 0.292964 0.302767 0.273778

16

1 0.288456 0.304967 0.275759

17

6 0.25 0.302246 0.305767 0.276519

18

0.5 0.298544 0.302856 0.273895

19

0.75 0.297919 0.304972 0.275769

20

1 0.294441 0.304801 0.275576

21

3 0.25 0.303535 0.303637 0.274653

22

0.5 0.300839 0.303986 0.274908

23

0.75 0.301813 0.303762 0.274750

24

1 0.300681 0.305853 0.276625

25 250 9 0.25 0.299585 0.306513 0.277211

26

0.5 0.294056 0.301339 0.272416

27

0.75 0.292320 0.305417 0.276170

28

1 0.288759 0.304640 0.275400

29

6 0.25 0.302050 0.307584 0.278153

30

0.5 0.299601 0.304732 0.275591

31

0.75 0.297984 0.305409 0.276203

32

1 0.296080 0.301589 0.272642

33

3 0.25 0.302384 0.307852 0.278436

34

0.5 0.301966 0.303866 0.274839

35

0.75 0.301620 0.301048 0.272269

36 1 0.300667 0.304903 0.275727

60

Table 4-9. Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) Testlet size 3

testlet Model Partial Credit Model Standard Rasch Model

condition Sample

size Testlet

No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE

37 1000 15 0.25 0.16783 0.01002 0.09522 0.00897 0.14925 0.01012 38

0.5 0.04346 0.00711 0.03334 0.00690 0.02660 0.00687

39

0.75 0.09725 0.00713 0.05011 0.00631 0.02341 0.00665 40

1 -0.09176 0.00853 -0.17260 0.00970 -0.13382 0.00917

41

10 0.25 0.06458 0.00752 0.15846 0.00704 0.13656 0.00695 42

0.5 0.10475 0.00796 0.10288 0.00792 0.11166 0.00792

43

0.75 -0.14755 0.00626 -0.00742 0.00691 0.08861 0.00867 44

1 -0.00417 0.01003 0.02655 0.01034 -0.01398 0.01000

45

5 0.25 0.11195 0.00713 0.11097 0.00700 0.10368 0.00724 46

0.5 0.11073 0.00726 0.06216 0.00687 0.03382 0.00674

47

0.75 -0.07001 0.00769 -0.10917 0.00841 -0.15884 0.00922 48

1 -0.09311 0.00689 -0.13323 0.00728 -0.18511 0.00840

49 500 15 0.25 0.11591 0.01240 0.07243 0.01268 0.14060 0.01139 50

0.5 0.19790 0.01497 0.06774 0.01321 0.16854 0.01560

51

0.75 -0.07199 0.01209 -0.01459 0.01071 -0.07251 0.01139 52

1 -0.13480 0.01271 -0.07473 0.01080 -0.05624 0.01054

53

10 0.25 0.14430 0.01326 0.10452 0.01271 0.17322 0.01378 54

0.5 0.01525 0.01222 -0.05002 0.01292 0.10895 0.00991

55

0.75 -0.04809 0.00981 -0.01907 0.00944 -0.02956 0.00953 56

1 0.03209 0.01093 -0.03091 0.01121 -0.15904 0.01326

57

5 0.25 -0.16260 0.01096 -0.19737 0.01107 -0.22554 0.01164 58

0.5 0.18784 0.01273 0.22612 0.01345 0.24754 0.01406

59

0.75 0.13105 0.01311 0.09802 0.01204 0.24434 0.01617 60

1 0.11728 0.00977 0.03991 0.00996 0.14934 0.00980

61 250 15 0.25 0.03807 0.01502 -0.04629 0.01401 0.03043 0.01421 62

0.5 0.29854 0.02358 0.24066 0.02088 0.10983 0.01594

63

0.75 -0.07253 0.01541 -0.05782 0.01348 -0.14003 0.01505 64

1 -0.03604 0.01275 -0.05762 0.01021 0.04707 0.01062

65

10 0.25 0.07542 0.01475 0.00293 0.01605 0.00552 0.01547 66

0.5 -0.24154 0.02061 -0.13173 0.01512 -0.20653 0.01843

67

0.75 -0.05263 0.01521 0.02047 0.01530 0.05522 0.01517 68

1 -0.03395 0.01608 -0.18840 0.02246 -0.25207 0.02576

69

5 0.25 0.14591 0.01773 0.15485 0.01743 0.22109 0.01918 70

0.5 0.25183 0.01889 0.20715 0.01614 0.20095 0.01647

71

0.75 0.05471 0.01360 -0.02376 0.01360 0.04982 0.01329

61

Table 4-9. Continued Testlet size 3

testlet Model

Partial Credit Model

Standard Rasch Model 0.02027 0.23503 0.02058

condition Sample

size Testlet


MAX

0.29854 0.02358 0.24066 0.02246 0.24754 0.02576

Overall Mean

0.02846 0.01181 0.01140 0.01148 0.02249 0.01197

Standard Deviation 0.12285 0.00420 0.11664 0.00415 0.14342 0.00440

62

Table 4-10. Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) Testlet size 5


condition Sample

size Testlet


1 1000 9 0.25 -0.19164 0.00867 -0.19664 0.00843 -0.15635 0.00793 2

0.5 0.15941 0.00980 0.11543 0.00955 0.07367 0.00901

3

0.75 -0.11687 0.01159 -0.03946 0.00972 0.09474 0.00840 4

1 0.18283 0.00951 0.16801 0.00965 0.13127 0.00905

5

6 0.25 0.00801 0.00638 -0.06287 0.00583 0.03093 0.00670 6

0.5 0.50256 0.02672 0.32653 0.01931 0.12354 0.01294

7

0.75 0.10712 0.00810 0.15731 0.00737 0.04488 0.00689 8

1 -0.00526 0.00877 -0.08249 0.00795 -0.12016 0.00827

9

3 0.25 0.06007 0.00712 0.00125 0.00679 0.00938 0.00665 10

0.5 -0.04465 0.00653 -0.05095 0.00658 -0.19904 0.00868

11

0.75 0.26770 0.01191 0.20402 0.01028 0.25305 0.01157 12

1 0.25933 0.00972 0.14790 0.00791 0.15701 0.00782

13 500 9 0.25 -0.13080 0.01029 -0.22012 0.01225 -0.10559 0.00996 14

0.5 -0.08486 0.01197 0.01553 0.01106 -0.17483 0.01453

15

0.75 0.01278 0.01215 -0.06039 0.01137 -0.01830 0.01170 16

1 -0.23682 0.01767 -0.30371 0.02006 -0.31090 0.02059

17

6 0.25 0.05779 0.00887 0.15835 0.01077 0.06598 0.00909 18

0.5 0.03734 0.01203 0.08598 0.01175 -0.00903 0.01127

19

0.75 -0.14255 0.01127 -0.17055 0.01082 -0.24755 0.01176 20

1 -0.23726 0.01445 -0.13707 0.01232 -0.00799 0.01118

21

3 0.25 -0.23119 0.01149 -0.14785 0.00943 -0.16435 0.00959 22

0.5 -0.01187 0.00864 -0.07000 0.00769 -0.15143 0.00828

23

0.75 -0.18909 0.01173 -0.15529 0.01066 -0.19302 0.01106 24

1 -0.17958 0.01117 -0.09252 0.00928 -0.07032 0.00904

25 250 9 0.25 -0.03497 0.01264 0.08511 0.01317 -0.13961 0.01437 26

0.5 0.09268 0.01438 0.04789 0.01380 -0.10150 0.01634

27

0.75 -0.01476 0.01506 -0.03546 0.01554 0.10886 0.01443 28

1 -0.38487 0.02121 -0.28872 0.01855 -0.28307 0.01810

29

6 0.25 0.05265 0.01544 0.13213 0.01716 0.03681 0.01452 30

0.5 0.10775 0.01580 0.09121 0.01380 -0.03764 0.01291

31

0.75 -0.10867 0.01352 -0.19622 0.01602 -0.31813 0.02217 32

1 0.07402 0.01357 0.10687 0.01220 0.03626 0.01079

33

3 0.25 0.28391 0.01869 0.11141 0.01456 0.11499 0.01452 34

0.5 0.24272 0.10725 0.13467 0.11244 -0.02008 0.12209

63

Table 4-10. Continued Testlet size 5


condition Sample

size Testlet


35

0.75 -0.11239 0.01639 -0.04690 0.01527 0.08978 0.01543 36

1 -0.04974 0.01230 0.03783 0.01222 -0.02647 0.01219

MIN

-0.38487 0.00638 -0.30371 0.00583 -0.31813 0.00665 MAX

0.50256 0.10725 0.32653 0.11244 0.25305 0.12209

Overall mean

0.00144 0.01516 -0.00765 0.01455 -0.04165 0.01479 Standard Deviation 0.18094 0.01634 0.14899 0.01718 0.14163 0.01878

64

Table 4-11. Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2

mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6

37 1000 15 0.25 0.473167 0.248143 0.183455 0.149797 0.091889 -0.111914 38

0.5 0.354534 0.114570 0.052191 0.027920 -0.028258 -0.190016

39

0.75 0.370821 0.154818 0.096164 0.085521 0.046400 -0.086360 40

1 0.172154 -0.039595 -0.095668 -0.104551 -0.138436 -0.289969

41

10 0.25 0.275666 0.152291 0.094051 0.046317 -0.037948 -0.209459 42

0.5 0.321742 0.195670 0.127418 0.087044 0.014702 -0.146453

43

0.75 0.005361 -0.072230 -0.123801 -0.164971 -0.230953 -0.376841 44

1 0.175261 0.077036 0.011355 -0.021059 -0.070775 -0.163715

45

5 0.25 0.435356 0.205839 0.137949 0.095536 0.012910 -0.141757 46

0.5 0.395263 0.187794 0.127652 0.095096 0.028981 -0.206536

47

0.75 0.219189 0.015793 -0.053097 -0.089053 -0.145936 -0.299091 48

1 0.242773 -0.016552 -0.076447 -0.111252 -0.172157 -0.361526

49 500 15 0.25 0.384715 0.183385 0.137153 0.108710 0.044345 -0.187429 50

0.5 0.477372 0.267986 0.211695 0.193633 0.131396 -0.150944

51

0.75 0.237362 -0.021731 -0.067369 -0.082894 -0.132175 -0.317696 52

1 0.097377 -0.097434 -0.134401 -0.139968 -0.177940 -0.290971

53

10 0.25 0.356236 0.228351 0.168639 0.126469 0.055644 -0.135595 54

0.5 0.212429 0.102587 0.040898 -0.002519 -0.078594 -0.239629

55

0.75 0.166690 0.043431 -0.030665 -0.070784 -0.135619 -0.289627 56

1 0.211727 0.102454 0.045510 0.011864 -0.035988 -0.183253

57

5 0.25 0.110223 -0.060440 -0.136362 -0.186604 -0.257486 -0.462181 58

0.5 0.462159 0.251401 0.213998 0.171382 0.075050 -0.151453

59

0.75 0.358289 0.211521 0.164480 0.120848 0.032428 -0.219213 60

1 0.444407 0.190858 0.133521 0.100451 0.028398 -0.132833

61 250 15 0.25 0.425444 0.088707 0.038860 0.014501 -0.043279 -0.203645 62

0.5 0.539062 0.356469 0.319489 0.290043 0.250876 0.020486

63

0.75 0.216494 -0.002026 -0.075173 -0.093850 -0.143783 -0.231325 64

1 0.174028 -0.016443 -0.042212 -0.030380 -0.074964 -0.255932

65

10 0.25 0.303218 0.178595 0.097837 0.056333 -0.012798 -0.177952 66

0.5 -0.043971 -0.156407 -0.226762 -0.259869 -0.310529 -0.473968

67

0.75 0.142809 0.024524 -0.040944 -0.074493 -0.129313 -0.256615 68

1 0.141291 0.054279 -0.019997 -0.056531 -0.113450 -0.224637

69

5 0.25 0.368075 0.246382 0.178723 0.124299 0.040473 -0.265577 70

0.5 0.555768 0.318019 0.273727 0.238187 0.132023 0.025779

71

0.75 0.306779 0.136065 0.077814 0.035790 -0.039109 -0.258084

65

Table 4-11. Continued

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


72

1 0.352490 0.160919 0.116425 0.086989 0.011101 -0.285605 Overall mean

0.290049 0.111529 0.053503 0.021610 -0.042024 -0.220320


66

Table 4-12. Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


37 1000 15 0.25 0.336622 0.203823 0.118733 0.070286 0.009691 -0.214690 38

0.5 0.315843 0.153878 0.058782 0.002898 -0.070801 -0.259818

39

0.75 0.324340 0.176789 0.072556 0.012976 -0.055217 -0.232093 40

1 0.079975 -0.054760 -0.153193 -0.210201 -0.273728 -0.474976

41

10 0.25 0.370687 0.259955 0.189054 0.139130 0.056038 -0.192075 42

0.5 0.355115 0.221533 0.130672 0.078644 0.000630 -0.233303

43

0.75 0.214934 0.116662 0.026076 -0.037418 -0.125863 -0.358937 44

1 0.247635 0.149033 0.052803 -0.000008 -0.069264 -0.220998

45

5 0.25 0.381780 0.223662 0.135451 0.088816 0.016893 -0.128480 46

0.5 0.313348 0.168042 0.079470 0.038035 -0.023125 -0.252448

47

0.75 0.155233 0.008629 -0.087356 -0.136890 -0.195870 -0.349476 48

1 0.174679 -0.018619 -0.113778 -0.163059 -0.225781 -0.413138

49 500 15 0.25 0.292574 0.178334 0.107760 0.053391 -0.024299 -0.321069 50

0.5 0.263480 0.169531 0.095938 0.049898 -0.016077 -0.315868

51

0.75 0.306953 0.110805 0.016032 -0.053743 -0.136533 -0.383503 52

1 0.166832 0.041945 -0.043852 -0.108448 -0.185566 -0.335328

53

10 0.25 0.290882 0.192752 0.127494 0.087437 0.025865 -0.211482 54

0.5 0.156712 0.059820 -0.019303 -0.072568 -0.150827 -0.372046

55

0.75 0.229094 0.111338 0.010573 -0.053234 -0.139269 -0.376540 56

1 0.181234 0.075932 -0.007133 -0.062984 -0.129794 -0.355068

57

5 0.25 0.035895 -0.075835 -0.171228 -0.228269 -0.291249 -0.486819 58

0.5 0.478965 0.325959 0.257307 0.199297 0.099948 -0.137633

59

0.75 0.280798 0.197895 0.130120 0.078434 0.003367 -0.224228 60

1 0.324746 0.154661 0.061032 0.009318 -0.064787 -0.239826

61 250 15 0.25 0.282271 0.032899 -0.039813 -0.077479 -0.134475 -0.290433 62

0.5 0.472681 0.359737 0.289173 0.218702 0.150106 -0.141069

63

0.75 0.217801 0.073962 -0.043076 -0.106858 -0.176700 -0.288534 64

1 0.189103 0.057994 -0.030791 -0.088302 -0.178435 -0.440313

65

10 0.25 0.248906 0.128714 0.029408 -0.018945 -0.093579 -0.359625 66

0.5 0.095283 -0.016983 -0.111733 -0.154756 -0.216678 -0.491812

67

0.75 0.280561 0.145899 0.043611 -0.017089 -0.092554 -0.342466 68

1 0.027517 -0.066781 -0.164813 -0.221067 -0.295618 -0.509271

69

5 0.25 0.343087 0.267476 0.181118 0.129846 0.061416 -0.212631 70

0.5 0.474667 0.308877 0.230810 0.184408 0.079458 -0.034207

71

0.75 0.208783 0.092116 0.002041 -0.052994 -0.126155 -0.350564 72

1 0.446835 0.315588 0.241987 0.204057 0.137092 -0.138479

67


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


Overall mean

0.265718 0.134757 0.047276 -0.006076 -0.079215 -0.296923 Standard Deviation 0.114902 0.112429 0.119329 0.120615 0.117579 0.114145

68

Table 4-13. Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability

intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


37 1000 15 0.25 0.377349 0.256007 0.174512 0.123467 0.066063 -0.158090 38

0.5 0.288100 0.142818 0.053952 -0.003419 -0.074942 -0.265710

39

0.75 0.278403 0.145130 0.047520 -0.012507 -0.080783 -0.257701 40

1 0.101816 -0.018816 -0.113572 -0.170834 -0.232747 -0.427665

41

10 0.25 0.335641 0.235327 0.168284 0.116694 0.036510 -0.205762 42

0.5 0.344568 0.225106 0.140802 0.087759 0.012315 -0.217320

43

0.75 0.300918 0.212029 0.123561 0.057606 -0.030350 -0.256467 44

1 0.193358 0.105935 0.014537 -0.041603 -0.108161 -0.261854

45

5 0.25 0.358078 0.212635 0.130214 0.080768 0.012205 -0.136275 46

0.5 0.270038 0.136912 0.052790 0.009639 -0.051126 -0.274681

47

0.75 0.088620 -0.045237 -0.134873 -0.186832 -0.242925 -0.400196 48

1 0.106223 -0.073708 -0.163902 -0.214979 -0.276585 -0.462143

49 500 15 0.25 0.367581 0.248949 0.175592 0.120463 0.044052 -0.251761 50

0.5 0.374032 0.274207 0.197322 0.149253 0.082147 -0.219028

51

0.75 0.244773 0.050650 -0.042050 -0.110614 -0.193649 -0.444443 52

1 0.188429 0.061434 -0.025102 -0.090149 -0.168285 -0.320278

53

10 0.25 0.364199 0.263265 0.196480 0.155625 0.092914 -0.146044 54

0.5 0.322251 0.221667 0.140289 0.085397 0.005789 -0.217023

55

0.75 0.226423 0.103118 0.000191 -0.064436 -0.151322 -0.390154 56

1 0.056398 -0.050612 -0.134877 -0.191533 -0.259697 -0.487368

57

5 0.25 0.010390 -0.103156 -0.199253 -0.256550 -0.320193 -0.517764 58

0.5 0.503030 0.348689 0.279222 0.220309 0.119510 -0.119608

59

0.75 0.434021 0.346219 0.276517 0.224106 0.148551 -0.080009 60

1 0.439148 0.265474 0.170702 0.118328 0.043324 -0.133062

61 250 15 0.25 0.333248 0.103105 0.038560 0.001722 -0.053561 -0.214062 62

0.5 0.334339 0.229383 0.161915 0.085442 0.019557 -0.273484

63

0.75 0.115012 -0.012460 -0.124758 -0.185956 -0.258118 -0.375915 64

1 0.276731 0.158015 0.074835 0.017737 -0.073405 -0.329492

65

10 0.25 0.237986 0.128453 0.034612 -0.017241 -0.092265 -0.345118 66

0.5 0.012850 -0.090193 -0.183208 -0.231655 -0.295516 -0.564457

67

0.75 0.303445 0.177303 0.079462 0.018418 -0.060361 -0.294325 68

1 -0.045749 -0.130230 -0.226555 -0.285692 -0.361091 -0.571902

69

5 0.25 0.401088 0.333323 0.249247 0.194643 0.126869 -0.137617 70

0.5 0.446812 0.297214 0.225627 0.179150 0.076715 -0.035870

71

0.75 0.266258 0.162389 0.077149 0.020855 -0.051654 -0.275809

69


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


72

1 0.447285 0.327455 0.257638 0.216147 0.151092 -0.120725 Overall mean

0.269530 0.145772 0.060927 0.006098 -0.066642 -0.283033


70

Table 4-14. Rasch testlet model (Testlet Size 5) Bias of ability ( ) estimate recovery (EAP) with 6 different ability Intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.094350 -0.095255 -0.174461 -0.213346 -0.281807 -0.428692 2

0.5 0.436489 0.237122 0.179649 0.147421 0.082498 -0.098116

3

0.75 0.122463 -0.050085 -0.104181 -0.127667 -0.196337 -0.321018 4

1 0.442925 0.250876 0.195557 0.171773 0.105662 -0.004987

5

6 0.25 0.379519 0.091610 0.030039 -0.010922 -0.073238 -0.322554 6

0.5 0.806265 0.587031 0.530736 0.488181 0.419653 0.115971

7

0.75 0.421201 0.186316 0.125033 0.086932 0.021690 -0.247222 8

1 0.357342 0.082943 0.013761 -0.023014 -0.081773 -0.347509

9

3 0.25 0.362618 0.155299 0.084661 0.030937 -0.039237 -0.213096 10

0.5 0.312980 0.065283 -0.015409 -0.071428 -0.131333 -0.272589

11

0.75 0.560732 0.350580 0.290498 0.234278 0.171695 0.001140 12

1 0.551760 0.339561 0.285196 0.233684 0.176530 -0.038358

13 500 9 0.25 0.145720 -0.033425 -0.114898 -0.143125 -0.207488 -0.400981 14

0.5 0.195187 -0.009050 -0.066950 -0.102276 -0.156287 -0.310182

15

0.75 0.315093 0.077165 0.023492 0.001618 -0.062240 -0.183802 16

1 0.084424 -0.159926 -0.233283 -0.259885 -0.321462 -0.424321

17

6 0.25 0.456119 0.138420 0.074700 0.034685 -0.039098 -0.245486 18

0.5 0.290231 0.128894 0.062876 0.020589 -0.044900 -0.247926

19

0.75 0.171234 -0.057490 -0.132013 -0.164068 -0.217203 -0.436617 20

1 0.150614 -0.149303 -0.226614 -0.265776 -0.326516 -0.478286

21

3 0.25 0.174309 -0.132377 -0.207516 -0.258158 -0.313038 -0.455341 22

0.5 0.255474 0.088922 0.017856 -0.042086 -0.097710 -0.226965

23

0.75 0.094036 -0.091653 -0.173193 -0.223266 -0.274488 -0.389311 24

1 0.148404 -0.087851 -0.155808 -0.202989 -0.257186 -0.429154

25 250 9 0.25 0.223756 0.047142 -0.020527 -0.060633 -0.128812 -0.269732 26

0.5 0.428441 0.184559 0.111482 0.065017 -0.005950 -0.094928

27

0.75 0.237617 0.041268 0.002304 -0.029471 -0.083137 -0.206376 28

1 -0.121960 -0.317932 -0.390940 -0.413389 -0.458683 -0.459374

29

6 0.25 0.309336 0.135764 0.074997 0.035037 -0.040391 -0.297998 30

0.5 0.380690 0.168035 0.126960 0.096969 0.033264 -0.191007

31

0.75 0.253362 -0.020784 -0.083982 -0.115322 -0.160088 -0.540590 32

1 0.314140 0.133842 0.093918 0.053564 0.002582 -0.247434

33

3 0.25 0.576460 0.384467 0.313977 0.263214 0.192252 0.044460 34

0.5 3.193850 1.675706 0.751623 -0.237276 -1.192136 -1.719127

35

0.75 0.117816 -0.015918 -0.074050 -0.135772 -0.192788 -0.400226

71


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


36

1 0.198452 0.029064 -0.021486 -0.078501 -0.140366 -0.262858 Overall mean

0.373374 0.121078 0.033167 -0.033735 -0.119941 -0.306961


72

Table 4-15. Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.057694 -0.074844 -0.173165 -0.226777 -0.293046 -0.493063 2

0.5 0.357073 0.227225 0.146618 0.092851 0.026046 -0.211425

3

0.75 0.183253 0.067466 -0.015145 -0.064679 -0.142554 -0.330352 4

1 0.414649 0.288070 0.197109 0.140516 0.057407 -0.106933

5

6 0.25 0.254284 0.056994 -0.037817 -0.089405 -0.166245 -0.386558 6

0.5 0.566313 0.442467 0.356654 0.305465 0.225716 -0.009284

7

0.75 0.445097 0.286063 0.183949 0.122392 0.034249 -0.192302 8

1 0.179819 0.039463 -0.057599 -0.109310 -0.178169 -0.357742

9

3 0.25 0.255866 0.110652 0.021129 -0.021659 -0.100628 -0.311067 10

0.5 0.255503 0.078655 -0.022398 -0.072849 -0.147754 -0.331184

11

0.75 0.465425 0.307731 0.224380 0.172734 0.095564 -0.136830 12

1 0.424246 0.261036 0.176704 0.121079 0.041691 -0.259527

13 500 9 0.25 0.005684 -0.103587 -0.196281 -0.239268 -0.298955 -0.544116 14

0.5 0.257783 0.125218 0.046717 -0.013190 -0.070077 -0.279283

15

0.75 0.214159 0.048767 -0.033883 -0.088606 -0.162287 -0.352665 16

1 -0.052872 -0.202811 -0.289531 -0.336450 -0.396747 -0.559555

17

6 0.25 0.435278 0.257183 0.176986 0.133877 0.062489 -0.090830 18

0.5 0.290807 0.206287 0.116569 0.060745 -0.011217 -0.173629

19

0.75 0.059018 -0.062851 -0.155885 -0.200646 -0.256321 -0.387814 20

1 0.118874 -0.030462 -0.119601 -0.169003 -0.232470 -0.328117

21

3 0.25 0.198857 -0.028118 -0.121680 -0.173547 -0.245166 -0.437178 22

0.5 0.133467 0.044319 -0.042115 -0.095718 -0.158739 -0.316826

23

0.75 0.089688 -0.035384 -0.139598 -0.189674 -0.257481 -0.409490 24

1 0.162975 0.012675 -0.068295 -0.111421 -0.172786 -0.376298

25 250 9 0.25 0.283943 0.171075 0.102312 0.057526 0.003210 -0.172672 26

0.5 0.322473 0.171561 0.077414 0.008952 -0.061644 -0.198336

27

0.75 0.165109 0.050177 -0.006297 -0.061494 -0.109914 -0.271395 28

1 -0.058034 -0.199634 -0.286395 -0.329072 -0.374657 -0.446141

29

6 0.25 0.291949 0.217848 0.150383 0.116097 0.052439 -0.114647 30

0.5 0.293698 0.177147 0.115804 0.074011 0.006296 -0.170862

31

0.75 0.045159 -0.091565 -0.168066 -0.213752 -0.266624 -0.512562 32

1 0.321524 0.210881 0.136044 0.071290 0.000412 -0.199886

33

3 0.25 0.303234 0.202762 0.132833 0.100714 0.039509 -0.107170 34

0.5 3.077542 1.569886 0.643957 -0.344942 -1.302150 -1.829526

35

0.75 0.149291 0.083300 0.000826 -0.071642 -0.152472 -0.447450 36

1 0.244549 0.134084 0.064922 0.012723 -0.058589 -0.223248

73


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


Overall mean

0.311483 0.139437 0.031599 -0.045337 -0.138102 -0.335443 Standard Deviation 0.496357 0.285833 0.182890 0.157178 0.245830 0.291249

74

Table 4-16. Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability

intervals

condition Sample

size Testlet No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.099971 -0.034221 -0.133403 -0.186551 -0.252010 -0.452877 2

0.5 0.318107 0.186509 0.104965 0.050271 -0.015544 -0.253090

3

0.75 0.310948 0.202532 0.120125 0.068132 -0.008049 -0.195075 4

1 0.360180 0.246459 0.160534 0.104280 0.025703 -0.138183

5

6 0.25 0.333569 0.147097 0.057703 0.004335 -0.071557 -0.289829 6

0.5 0.348352 0.236751 0.154771 0.100999 0.026762 -0.203307

7

0.75 0.319864 0.171062 0.072551 0.009943 -0.077491 -0.297234 8

1 0.126029 -0.002200 -0.093594 -0.147499 -0.213030 -0.391729

9

3 0.25 0.248496 0.116344 0.031513 -0.013947 -0.091865 -0.298587 10

0.5 0.091450 -0.073315 -0.168065 -0.222511 -0.293491 -0.471152

11

0.75 0.502321 0.354719 0.274381 0.222110 0.145063 -0.079361 12

1 0.418671 0.266539 0.186816 0.130330 0.051680 -0.236261

13 500 9 0.25 0.135963 0.020694 -0.078640 -0.127076 -0.193024 -0.445134 14

0.5 0.080448 -0.057799 -0.140910 -0.206017 -0.267658 -0.482120

15

0.75 0.266265 0.095822 0.009320 -0.048039 -0.124395 -0.318598 16

1 -0.049814 -0.205177 -0.294881 -0.344937 -0.410871 -0.582785

17

6 0.25 0.344499 0.166983 0.086702 0.040895 -0.034596 -0.191486 18

0.5 0.200928 0.113445 0.022554 -0.034379 -0.109346 -0.276489

19

0.75 -0.012288 -0.136464 -0.230731 -0.278090 -0.339425 -0.480841 20

1 0.260057 0.105714 0.011334 -0.042154 -0.110330 -0.210707

21

3 0.25 0.184735 -0.044013 -0.138392 -0.190226 -0.261472 -0.453125 22

0.5 0.056858 -0.034257 -0.122437 -0.177865 -0.243289 -0.404036

23

0.75 0.054648 -0.072215 -0.177272 -0.227534 -0.296165 -0.450183 24

1 0.186734 0.036123 -0.045174 -0.089354 -0.152590 -0.358227

25 250 9 0.25 0.074001 -0.047596 -0.123366 -0.169803 -0.222756 -0.401669 26

0.5 0.190891 0.030451 -0.069547 -0.142738 -0.219874 -0.367389

27

0.75 0.317890 0.197999 0.141188 0.083001 0.026529 -0.145503 28

1 -0.038382 -0.186940 -0.277641 -0.325517 -0.383315 -0.475526

29

6 0.25 0.206191 0.127465 0.055987 0.019072 -0.046796 -0.218904 30

0.5 0.173974 0.052087 -0.012475 -0.056456 -0.124616 -0.303180

31

0.75 -0.064167 -0.206921 -0.289422 -0.337861 -0.390709 -0.638718 32

1 0.265819 0.147722 0.067491 -0.002090 -0.077987 -0.285482

33

3 0.25 0.313663 0.208113 0.136700 0.103557 0.041626 -0.105400 34

0.5 2.922879 1.415147 0.489106 -0.499631 -1.456934 -1.983828

35

0.75 0.294057 0.222541 0.138180 0.064553 -0.017974 -0.315192

75


condition Sample

size Testlet No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


36

1 0.185198 0.072606 0.002331 -0.052120 -0.127573 -0.298190 Overall mean

0.278583 0.106661 -0.001992 -0.081137 -0.175483 -0.374983


76

Table 4-17. Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability

intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2

mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6

1000 15 0.25 0.078337 0.026778 0.013562 0.013996 0.022898 0.048603 38

0.5 0.058554 0.023512 0.013291 0.011807 0.020247 0.043903

39

0.75 0.109081 0.025007 0.015425 0.011851 0.024913 0.015921 40

1 0.042240 0.021694 0.016620 0.015158 0.024771 0.045446

41

10 0.25 0.078190 0.019787 0.011747 0.011594 0.021015 0.003677 42

0.5 0.093387 0.022832 0.013078 0.013353 0.021377 0.005111

43

0.75 0.073321 0.016861 0.013579 0.011982 0.026210 0.057016 44

1 0.070156 0.027408 0.014366 0.014792 0.020166 0.083354

45

5 0.25 0.084415 0.019277 0.012643 0.012258 0.018854 0.027661 46

0.5 0.084047 0.023888 0.012558 0.012447 0.020792 0.084284

47

0.75 0.059131 0.019724 0.011052 0.012973 0.024332 0.122048 48

1 0.066213 0.018199 0.012070 0.011800 0.021689 0.081700

49 500 15 0.25 0.192334 0.033274 0.019610 0.017669 0.023263 0.023885 50

0.5 0.147215 0.039461 0.020473 0.022246 0.027377 0.028104

51

0.75 0.088903 0.030005 0.018717 0.020319 0.038214 0.321320 52

1 0.086655 0.032271 0.022941 0.021482 0.034426 0.075319

53

10 0.25 0.119339 0.029762 0.020720 0.024285 0.024943 0.150809 54

0.5 0.080430 0.033565 0.018110 0.014816 0.041293 0.006453

55

0.75 0.082724 0.025981 0.017898 0.015908 0.038934 0.015708 56

1 0.098487 0.033379 0.020264 0.018472 0.030859 0.023337

57

5 0.25 0.086273 0.028639 0.018110 0.020713 0.035449 0.026388 58

0.5 0.143987 0.037974 0.019936 0.021455 0.032242 0.071056

59

0.75 0.156287 0.036786 0.022283 0.019618 0.028308 0.131062 60

1 0.184270 0.038350 0.019849 0.016660 0.029845 0.013516

61 250 15 0.25 0.292779 0.039214 0.027430 0.023809 0.034594 0.087950 62

0.5 0.218673 0.084971 0.042775 0.036884 0.055996 0.086146

63

0.75 0.144418 0.049953 0.025300 0.027950 0.047892 0.264669 64

1 0.140580 0.042917 0.021739 0.025272 0.049209 0.330666

65

10 0.25 0.178133 0.057100 0.027217 0.021679 0.040607 0.114075 66

0.5 0.120084 0.049634 0.029591 0.037874 0.055365 0.092475

67

0.75 0.113360 0.051918 0.024542 0.025663 0.053502 0.196613 68

1 0.155949 0.036788 0.026793 0.027741 0.042362 0.004325

69

5 0.25 0.134323 0.057182 0.030049 0.026274 0.042960 0.265263 70

0.5 0.182210 0.067420 0.030899 0.032589 0.045741 0.266494

71

0.75 0.119217 0.047717 0.022420 0.024159 0.041848 0.088415

77


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


72

1 0.108693 0.045403 0.029070 0.027282 0.041949 0.020853 Overall mean

0.118678 0.035962 0.020465 0.020134 0.033457 0.092323


78

Table 4-18. Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


37 1000 15 0.25 0.065677 0.023965 0.011388 0.011902 0.023170 0.067682 38

0.5 0.060728 0.023976 0.013179 0.011456 0.020410 0.043749

39

0.75 0.099028 0.024728 0.013634 0.010496 0.023721 0.010887 40

1 0.048958 0.019700 0.017045 0.017225 0.031831 0.086938

41

10 0.25 0.094635 0.027001 0.014954 0.013781 0.019667 0.012397 42

0.5 0.090130 0.024572 0.013120 0.013289 0.021730 0.009448

43

0.75 0.083380 0.018589 0.011610 0.010228 0.022013 0.024883 44

1 0.078642 0.028244 0.014527 0.013482 0.020061 0.096752

45

5 0.25 0.082899 0.019888 0.012796 0.012045 0.019864 0.030363 46

0.5 0.080121 0.023337 0.011771 0.011788 0.020182 0.079555

47

0.75 0.064722 0.019505 0.011721 0.014189 0.025794 0.124530 48

1 0.067890 0.017737 0.012473 0.012460 0.024051 0.097091

49 500 15 0.25 0.154701 0.035439 0.017829 0.016028 0.022127 0.020737 50

0.5 0.105886 0.034800 0.017357 0.017181 0.026016 0.025455

51

0.75 0.097656 0.027177 0.016228 0.018008 0.034151 0.270534 52

1 0.086244 0.028811 0.018696 0.018263 0.034193 0.068944

53

10 0.25 0.111764 0.029271 0.018249 0.023287 0.025120 0.139872 54

0.5 0.078844 0.030477 0.018031 0.014985 0.043723 0.016498

55

0.75 0.090798 0.028041 0.017230 0.015321 0.039667 0.031949 56

1 0.098187 0.032181 0.018453 0.018955 0.032947 0.000127

57

5 0.25 0.091645 0.029466 0.018295 0.022806 0.038532 0.037234 58

0.5 0.147778 0.045405 0.021106 0.022665 0.033154 0.080828

59

0.75 0.137374 0.036145 0.020466 0.019162 0.028722 0.122420 60

1 0.154455 0.036054 0.016647 0.016980 0.029782 0.033431

61 250 15 0.25 0.228860 0.042316 0.025572 0.026715 0.039057 0.039881 62

0.5 0.190427 0.086004 0.037864 0.029954 0.044274 0.067882

63

0.75 0.132743 0.045510 0.021947 0.024455 0.049834 0.315181 64

1 0.139310 0.033277 0.017402 0.021319 0.052945 0.171415

65

10 0.25 0.169489 0.053985 0.023519 0.021089 0.044170 0.138256 66

0.5 0.121261 0.048928 0.024269 0.027797 0.046095 0.047430

67

0.75 0.120247 0.054696 0.024694 0.022111 0.050509 0.222708 68

1 0.136154 0.038438 0.026148 0.033798 0.059179 0.047885

69

5 0.25 0.138147 0.061613 0.029549 0.026912 0.045897 0.267080 70

0.5 0.175470 0.065379 0.026202 0.027847 0.042611 0.178891

71

0.75 0.110521 0.045314 0.022411 0.023566 0.046565 0.070210 72

1 0.147110 0.059226 0.037350 0.033784 0.045651 0.060984

79


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


Overall mean

0.113386 0.036089 0.019270 0.019315 0.034095 0.087781


80

Table 4-19. Standard Rasch model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability

intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


37 1000 15 0.25 0.071970 0.027048 0.012636 0.012719 0.023061 0.060829 38

0.5 0.060926 0.023909 0.012678 0.011401 0.020533 0.045204

39

0.75 0.090384 0.023542 0.013484 0.011052 0.022812 0.019934 40

1 0.050341 0.019446 0.016044 0.016289 0.029408 0.083912

41

10 0.25 0.086566 0.025718 0.014281 0.012950 0.019438 0.008700 42

0.5 0.086802 0.025173 0.012795 0.013299 0.021531 0.011279

43

0.75 0.094025 0.023338 0.011782 0.010458 0.020305 0.004227 44

1 0.073421 0.027330 0.014314 0.014277 0.020211 0.100202

45

5 0.25 0.080224 0.019463 0.012707 0.012450 0.019174 0.027102 46

0.5 0.075284 0.022845 0.011400 0.011556 0.020183 0.071554

47

0.75 0.062711 0.020036 0.012618 0.015555 0.026669 0.131456 48

1 0.063622 0.018998 0.013537 0.014393 0.026115 0.105073

49 500 15 0.25 0.174828 0.040648 0.019626 0.017030 0.022111 0.000227 50

0.5 0.127802 0.041100 0.018820 0.018756 0.026461 0.005880

51

0.75 0.093012 0.025979 0.016592 0.019151 0.037344 0.303639 52

1 0.089175 0.028986 0.018560 0.017811 0.033151 0.060292

53

10 0.25 0.123712 0.033044 0.020933 0.025251 0.024102 0.154511 54

0.5 0.098728 0.040028 0.018633 0.015801 0.039165 0.021575

55

0.75 0.089436 0.027861 0.017481 0.015466 0.040531 0.034278 56

1 0.088494 0.027750 0.020219 0.022417 0.041304 0.037713

57

5 0.25 0.089638 0.029695 0.019246 0.024006 0.040744 0.044998 58

0.5 0.153260 0.047723 0.022317 0.023691 0.034507 0.088560

59

0.75 0.180469 0.048970 0.027495 0.022214 0.031881 0.162449 60

1 0.186296 0.042333 0.019634 0.016705 0.029095 0.007079

61 250 15 0.25 0.244394 0.045048 0.025937 0.023603 0.034617 0.039916 62

0.5 0.147001 0.065525 0.028917 0.025894 0.037003 0.022912

63

0.75 0.113811 0.045457 0.024075 0.027296 0.053748 0.376276 64

1 0.157073 0.034502 0.018100 0.019511 0.046642 0.240287

65

10 0.25 0.167921 0.055283 0.023304 0.021270 0.042570 0.130323 66

0.5 0.123050 0.048287 0.027207 0.033872 0.052401 0.093415

67

0.75 0.125161 0.056347 0.024480 0.022407 0.048606 0.187861 68

1 0.134140 0.041803 0.028784 0.038777 0.067885 0.084995

69

5 0.25 0.154649 0.069085 0.032520 0.030000 0.047157 0.292877 70

0.5 0.168907 0.064514 0.025804 0.028420 0.042436 0.180015

71

0.75 0.121488 0.053159 0.021906 0.023698 0.041709 0.099905

81


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


72

1 0.149801 0.060850 0.037591 0.034294 0.045378 0.056767 Overall mean

0.116626 0.037523 0.019902 0.020104 0.034166 0.094340


82

Table 4-20. Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery (eap) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.048535 0.020547 0.013542 0.014905 0.025987 0.132773 2

0.5 0.104665 0.027705 0.016009 0.016919 0.019188 0.098599

3

0.75 0.048658 0.022476 0.013048 0.016071 0.030985 0.114531 4

1 0.081613 0.028248 0.015910 0.014095 0.021880 0.069132

5

6 0.25 0.070898 0.022451 0.010947 0.011000 0.024396 0.062862 6

0.5 0.172489 0.050252 0.031796 0.026443 0.036537 0.004791

7

0.75 0.093437 0.024567 0.012562 0.013693 0.022836 0.080497 8

1 0.119667 0.021661 0.015078 0.010693 0.021798 0.100457

9

3 0.25 0.087055 0.020796 0.012494 0.011860 0.019987 0.034308 10

0.5 0.088673 0.022802 0.011603 0.010394 0.022048 0.026025

11

0.75 0.114683 0.029492 0.020703 0.015341 0.022141 0.175543 12

1 0.119646 0.030957 0.016737 0.016704 0.028018 0.003056

13 500 9 0.25 0.073423 0.028199 0.017699 0.019940 0.030755 0.112060 14

0.5 0.104189 0.034109 0.018287 0.020583 0.034834 0.062598

15

0.75 0.110667 0.032705 0.017843 0.016505 0.026520 0.141207 16

1 0.072086 0.029727 0.024652 0.026943 0.048265 0.173372

17

6 0.25 0.093659 0.037446 0.020469 0.014655 0.026255 0.077015 18

0.5 0.088630 0.032820 0.018278 0.020875 0.030910 0.080459

19

0.75 0.098950 0.028540 0.021342 0.022284 0.033999 0.126608 20

1 0.071756 0.040521 0.024634 0.027753 0.049917 0.147327

21

3 0.25 0.146400 0.027316 0.019713 0.022029 0.054734 0.101774 22

0.5 0.094459 0.034377 0.014855 0.018803 0.030531 0.068490

23

0.75 0.083904 0.028614 0.019628 0.022143 0.038706 0.030053 24

1 0.072267 0.031387 0.019253 0.018611 0.040400 0.066370

25 250 9 0.25 0.104921 0.036662 0.020843 0.030739 0.053081 0.013633 26

0.5 0.137340 0.059429 0.025746 0.024807 0.045477 0.180335

27

0.75 0.136443 0.040754 0.027155 0.025382 0.051492 0.075550 28

1 0.107111 0.047915 0.050709 0.045143 0.084078 0.391390

29

6 0.25 0.144626 0.045561 0.025586 0.027354 0.038179 0.026828 30

0.5 0.138057 0.055004 0.026784 0.026009 0.041708 0.362921

31

0.75 0.199151 0.049567 0.024088 0.024056 0.045059 0.011575 32

1 0.127239 0.046378 0.022620 0.025210 0.046614 0.234785

33

3 0.25 0.183628 0.067105 0.045098 0.031144 0.051958 0.060797 34

0.5 1.788592 0.244154 0.092046 0.186136 0.065807 0.192168

35

0.75 0.156430 0.046850 0.026329 0.027011 0.046847 0.061260 36

1 0.105701 0.042695 0.021221 0.021594 0.048607 0.138145

83


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


Overall mean

0.155268 0.041383 0.023203 0.025662 0.037793 0.106647 Standard Deviation 0.282199 0.036636 0.014413 0.028410 0.014431 0.087627

84

Table 4-21. Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.055078 0.019749 0.012548 0.014505 0.026118 0.128978 2

0.5 0.093575 0.027011 0.014290 0.015383 0.020417 0.085914

3

0.75 0.062688 0.019097 0.011141 0.013724 0.025965 0.079033 4

1 0.090919 0.029425 0.016153 0.012023 0.019412 0.075747

5

6 0.25 0.071424 0.022914 0.010005 0.011628 0.028386 0.072207 6

0.5 0.124677 0.038863 0.023997 0.018060 0.025961 0.021936

7

0.75 0.105384 0.028632 0.013467 0.012453 0.024295 0.116020 8

1 0.069542 0.019988 0.013672 0.012453 0.024686 0.120254

9

3 0.25 0.071793 0.019053 0.011908 0.011416 0.021645 0.052494 10

0.5 0.080230 0.022653 0.011691 0.010857 0.022459 0.044464

11

0.75 0.100348 0.027188 0.017871 0.013633 0.020760 0.158864 12

1 0.100129 0.025238 0.013624 0.013346 0.022882 0.052848

13 500 9 0.25 0.087658 0.030786 0.021063 0.021593 0.037533 0.131595 14

0.5 0.120225 0.035837 0.017506 0.018423 0.032209 0.072691

15

0.75 0.108335 0.030604 0.014933 0.016627 0.026840 0.091080 16

1 0.075110 0.030694 0.026972 0.032415 0.054812 0.181353

17

6 0.25 0.119522 0.049433 0.021642 0.017807 0.026338 0.109247 18

0.5 0.104570 0.037916 0.018646 0.020389 0.030663 0.073772

19

0.75 0.094708 0.027392 0.021657 0.023170 0.036143 0.138768 20

1 0.091668 0.037460 0.020997 0.021922 0.042133 0.108095

21

3 0.25 0.153340 0.032996 0.016169 0.017654 0.047098 0.080737 22

0.5 0.094481 0.032526 0.013223 0.019982 0.033307 0.096095

23

0.75 0.096467 0.026004 0.019063 0.019189 0.036804 0.054503 24

1 0.087235 0.031068 0.017224 0.015474 0.034156 0.001936

25 250 9 0.25 0.133246 0.043864 0.021702 0.027811 0.039083 0.049585 26

0.5 0.145823 0.058810 0.024323 0.023814 0.043880 0.212430

27

0.75 0.117474 0.038922 0.023324 0.026192 0.049893 0.067516 28

1 0.117828 0.041894 0.037692 0.036645 0.068336 0.253170

29

6 0.25 0.111735 0.051914 0.028435 0.029285 0.036509 0.048822 30

0.5 0.135587 0.055161 0.023390 0.024883 0.040852 0.313178

31

0.75 0.256954 0.051364 0.026250 0.028499 0.056166 0.027842 32

1 0.153590 0.046054 0.020337 0.025486 0.048274 0.178885

33

3 0.25 0.117107 0.047913 0.032525 0.024273 0.046733 0.025307 34

0.5 1.727757 0.225916 0.080002 0.195150 0.050771 0.121486

35

0.75 0.170151 0.046210 0.023578 0.025170 0.042063 0.056915 36

1 0.127922 0.043790 0.021082 0.021712 0.044274 0.156180

85


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


Overall mean

0.154841 0.040398 0.021169 0.024807 0.035774 0.101665

Standard Deviation

0.272132 0.033626 0.011842 0.029903 0.011965 0.066253

86

Table 4-22. Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals

condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


1 1000 9 0.25 0.054505 0.019157 0.011612 0.013639 0.024171 0.123000 2

0.5 0.086199 0.025481 0.013019 0.015002 0.020782 0.077019

3

0.75 0.079972 0.019311 0.012218 0.013808 0.022459 0.055386 4

1 0.082276 0.027552 0.015155 0.011565 0.018216 0.072618

5

6 0.25 0.080914 0.025310 0.011499 0.010704 0.024551 0.058512 6

0.5 0.083499 0.026217 0.015746 0.011670 0.020758 0.055891

7

0.75 0.086500 0.023057 0.011474 0.011640 0.026323 0.098057 8

1 0.062397 0.020569 0.014227 0.013574 0.026134 0.123626

9

3 0.25 0.070911 0.019228 0.011657 0.011668 0.021136 0.050452 10

0.5 0.064980 0.022725 0.015423 0.013908 0.028543 0.072700

11

0.75 0.107659 0.030363 0.020102 0.015079 0.020511 0.172489 12

1 0.098714 0.025765 0.013472 0.013395 0.022484 0.045555

13 500 9 0.25 0.086103 0.027197 0.017137 0.019369 0.030920 0.104854 14

0.5 0.092530 0.034024 0.017266 0.022220 0.042295 0.115485

15

0.75 0.114248 0.031499 0.015170 0.015623 0.025589 0.100740 16

1 0.074132 0.030469 0.027548 0.032850 0.056242 0.195890

17

6 0.25 0.102211 0.040934 0.020133 0.015024 0.026044 0.075459 18

0.5 0.092445 0.032911 0.017773 0.019562 0.030586 0.092825

19

0.75 0.090080 0.029781 0.025252 0.026233 0.043281 0.153303 20

1 0.101604 0.039027 0.019060 0.018473 0.035121 0.066013

21

3 0.25 0.150032 0.032273 0.016446 0.018240 0.048618 0.085607 22

0.5 0.088842 0.031174 0.014248 0.023129 0.038932 0.118433

23

0.75 0.096743 0.026986 0.020221 0.020743 0.040151 0.041003 24

1 0.088962 0.030766 0.016913 0.015060 0.033199 0.003464

25 250 9 0.25 0.108200 0.037172 0.023692 0.035562 0.060802 0.055020 26

0.5 0.118448 0.050578 0.023559 0.028191 0.052439 0.293215

27

0.75 0.152811 0.044221 0.025725 0.024325 0.041710 0.035497 28

1 0.112522 0.040882 0.037117 0.035945 0.068377 0.269680

29

6 0.25 0.097628 0.045396 0.024069 0.027508 0.038001 0.009918 30

0.5 0.115448 0.046363 0.021887 0.025772 0.045353 0.245964

31

0.75 0.261414 0.057456 0.033796 0.039445 0.070626 0.037632 32

1 0.131436 0.040350 0.017983 0.023282 0.051087 0.221120

33

3 0.25 0.118245 0.047826 0.032667 0.024198 0.046625 0.028095 34

0.5 1.664506 0.200187 0.062951 0.211889 0.040309 0.053299

35

0.75 0.210820 0.057664 0.026479 0.025437 0.038700 0.123125

87


condition Sample

size Testlet

No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2


36

1 0.118271 0.041488 0.021033 0.021938 0.047101 0.131175 Overall mean

0.148506 0.038371 0.020659 0.025713 0.036894 0.101726


88

Table 4-23. NBOME LEVEL-2 Block 1 Item WMSE Item Testlet Model Rasch Model Partial Credit Model

1 1.03 1.01 1 2 1.03 1.03 1.02 3 1.01 1.01 1 4 1.19 1.04 1.06 5 1.06 1.03 1.02 6 0.99 0.99 0.99 7 0.99 0.98 0.97 8 1.08 1 1 9 0.97 0.98 0.98 10 1.04 1.02 1.01 11 1.03 1 0.99 12 1.03 1.01 1 13 1.04 1.01 1 14 1.04 1.04 1.02 15 1.09 1.01 0.99 16 1.03 1.03 1 17 1.05 1.03 1 18 1.1 1.04 1.01 19 1.04 1.01 0.99 20 1 1.01 1 21 1 0.98 0.98 22 1.02 1.04 1.01 23 1.06 1 0.99 24 1.04 1.03 1.01 25 1.07 1.02 1.02 26 1.05 1.02 1.02 27 1.02 1.02 1 28 1.14 1.02 1.03 29 1.03 0.99 1.03 30 0.97 1 1.01 31 1.06 1 1.01 32 1.01 0.99 0.97 33 1.09 0.99 0.99 34 1.13 1.18 1.03 35 1.04 1 1.02 36 1 1.04 1.07 37 0.93 0.99 1.02 38 1.04 1.01 39 1.05 0.97 40 1.07 0.98 41 1.03 1 42 1.01 0.97 43 0.96 1 44 0.96 1.03 45 1.01 0.99 46 0.86 0.99 47 1.06 1.02 48 1.16 1.02 49 1.02 1.01 50 1.08 1.02

MAX 1.19 1.18 1.07 MIN 0.86 0.97 0.97

MEAN 1.0362 1.012 1.007027 SD 0.055729 0.0309 0.021064

89

Table 4-24. COMLEX-Level 2 2008 block-1 local item dependence detection results sequence Significant results Item pair P-value

1 item3, item23 0.002 2 item5, item42 0.009 3 item6, item19 0.000 4 item6, item21 0.000 5 item7, item49 0.007 6 item9, item 39 0.000 7 item9, item40 0.004 8 item10, item36 0.009 9 item11, item15 0.002 10 item11, item35 0.009 11 item12, item43 0.009 12 item19, item21 0.004 13 item20, item30 0.002 14 item28, item31 0.000 15 item29, item30 0.000 16 item29, item39 0.002 17 item30,item31 0.009 18 item32, item33 0.000 19 item35, item42 0.000 20 item37, item38 0.004 21 item37, item40 0.000 22 item39, item40 0.000 23 item41, item42 0.000 24 item43, item44 0.000 25 item45, item46 0.000 26 item47, item48 0.000

Note: Number of sampled matrices: 450; Number of Item-Pairs tested: 1225;

Item-Pairs with one-sided p < 0.01

90

CHAPTER 5 DISCUSSION

5.1 General Discussion

In accordance with the simulation results and the empirical case results, several

empirical findings related to testlet modeling emerged in this study. First, our results

suggest that the Partial Credit model and Rasch testlet model performed better than the

standard Rasch model under the small and medium testlet size circumstances. No

sufficient evidences indicate which model performs better with regard to the

performance comparison between the partial credit model and the Rasch testlet model.

The results also show that sample size has a significant effect on the analysis results for

the three models. As the sample size increases, the discrepancies between model

estimates and the real data set increases. Also, the degree of the test reliability

overestimation for the standard Rasch model increases when the sample size

increases. In addition, when the testlet size keeps at medium level (i.e. testlet size <5),

the ratio of the independent/testlet item within a test in terms of the number of testlets

plays a major role in the efficiency of models to recover examinee‟s ability parameters.

Second, the findings display that there is no obvious difference of the test reliability

estimates between the Partial Credit model and the Rasch testlet model. This “no

difference” finding indicates that under the small testlet size circumstance, employing

polytomous IRT model to testlets is an approach which does not result in a reduction in

test reliability. Previous concerns about test reliability reduction by applying polytomous

IRT model to the testlets (Keller, Swaminathan, &Sireci, 2003) are not as severe as we

expected in the small and medium testlet size situations. We believe that the small

91

number of parameters dropped when the polytomous items are formed do not

drastically hurt the estimates of reliability of the entire test when the testlet size is small.

Third, the standard error of measurement results from the ability parameter

estimation suggests that the standard Rasch model apparently underestimates standard

error of measurement compared with other two models.

Fourth, the bias and RMSE results from the process of the ability parameter

recovery indicates that no evident pattern can be found to reveal the association

between the factor variations (i.e., testlet size, the sample size, the number of testlets

within a test) and the bias/RMSE result changes. The magnitude of the local item

effects does not have an evident impact on the accuracy of the ability estimation.

However, this study only investigates a small range of the local dependence effects (i.e.

[0,1]). A broader range of the local dependence effect is worthy of more investigation.

Using EAP estimates has major effects on the bias/RMSE result changes at both tails of

ability distribution. In sum, because these three models are all Rasch type model, the

precision of the ability parameter recovery for these three models is relatively well. All

three Rasch type models do show some robustness, to some extent, when face up to

the violation of local item independence assumption.

5.2 Limitations and Suggestions for Future Research

Although there is no obvious discrepancy of the test reliability estimates between

the Rasch testlet model and the Partial Credit model, some parameters are dropped

from the polytomous IRT model compared to the dichotomous IRT model application.

Therefore, because of this parameter dropping issue, a decrease in the reliability is still

expected (Sireci, et al., 1991). The question is whether the test reliability decrease is

due to the change of the test format (i.e. from dichotomous items to a polytomous item

92

within a testlet) or is due to the local item dependence within testlets. Therefore, the

“faking testlet” was proposed by Yen (1993) and applied by Zenisky, et al. (2002). The

“faking testlet” is formed randomly by the independent dichotomous items in a test. By

comparing the test reliability estimates between the test with “real testlet” and the test

with “fake testlet”, we can answer this aforementioned question. Thus, the true cause of

the test reliability reduction will be obtained. Because of limited time, we do not

generate the “faking testlet” in this study to compare the test reliability differences of

these two aforementioned situations. For future research, it is worthwhile to include

investigations of reliability discrepancies between the “faking testlet” and the “real

testlet” situations.

5.3 Conclusion

This study compares the performance of three different models in small and

medium testlet size situations across changes in sample size, variations of the ratio of

independent items to testlet items within a test, and changes of the local item effects.

The study findings indicate that using the polytomous IRT model for testlet item

analyses is still efficient for small testlet size and non-adaptive typed tests. Although,

under this small testlet size situation, the Rasch testlet model and the Partial Credit

model both show better performances than the standard Rasch model, having a large

proportion of testlet items in a test will result in the instability of the Rasch testlet model,

for the large number of MLE non-convergence rate occurrences in Rasch testlet model

application. For small testlet sizes, polytomous IRT models are more stable than the

Rasch testlet model when there are a large number of the testlets included in a test.

This instability may be caused by the multidimensionality feature of the Rasch testlet

93

model. The relationship between the model instability and its multidimensionality is

worthy of further investigation.

Furthermore, the analysis efficiency of models should also be considered for

testlet analysis model selection. The simulations were conducted via personal computer

with a 2.83-GHz Intel Xeon inside. It took 1459771.40 seconds (i.e. approximately 405.5

hours) to complete 12 conditions with the Rasch testlet model data simulation and

analysis. It only took 48659.05 seconds (i.e. approximately 13.5 hours) for the Partial

Credit model to complete the data simulation and analysis for the corresponding

conditions. Typically, using ConQuest took approximately 45 to 70 minutes for a single

calibration of the Rasch testlet model, but it only took approximately 5-8 minutes for a

single calibration of the Partial Credit model.

The investigation of the models used to analyze the testlet items based on the

small testlet size circumstances, provides guidance for model selection for future testlet-

type data analysis. The polytomous IRT model and Rasch testlet model offers an

advantage over the standard Rasch model as it avoids standard error of measurement

underestimation and better ability parameter estimations in the small testlet size

situations.

94

LIST OF REFERENCES

Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1), 1–23.

Armstrong, Ronald D (2004). Computerized Adaptive Testing With Multiple-Form Structures. Applied psychological measurement, 28(3), 147-164.

Andrich, D. (1978). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.

Ariel, A., Veldkamp, B.P., Breithaupt, K. (2006). Optimal Testlet Pool Assembly for Multistage Testing Designs. Applied Psychological Measurement, 30(3), 204 215.

Baldwin, S.G. (2007). A review of Testlet response theory and its applications. Journal of Educational and Behavioral Statistics, 32(3), 333-336.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.

Bock, R. D.,& Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153-168.

Brandt, S. (2008) Estimation of a Rasch Model Including Subdimensions. IERI Monograph Series Issues and Methodologies in Large-Scale Assessments, 1,51-70.

Breithaupt, K,Ariel, A, Veldkamp, B.P.(2005). Automated Simultaneous Assembly for Multistage Testing. International Journal of Testing, 5(3), 319-330.

Breithaupt, Krista (2007). Automated Simultaneous Assembly of Multistage Testlets for a High-12.Stakes Licensing Examination. Educational and psychological measurement, 67 (1), 5-20.

Chen, W.H., Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics. 22(3), 265-289.

Davis, Laurie Laughlin (2003). Item Exposure Constraints for Testlets in the Verbal Reasoning Section of the MCAT. Applied psychological measurement, 27(5), 335-356.

DeMars,C.E. (2006).Application of the Bi-Factor Multidimensional Item Response Theory Model to Testlet-Based Tests. Journal of Educational Measurement 43, ( 2). 145–168.

95

Feldt, L. S.(2002). Estimating the internal consistency reliability of tests composed of testlets varying in length. Applied Measurement in Education, 15(1), 33-48.

Fischer, G.H. (1974). Einfahrung in die Theoriepsychologischer Tests [Introduction to mental test theory]. Berne: Huber.

Gessaroli, M.E., Folske, J. C.(2002). Generalizing the Reliability of Tests Comprised of Testlets. International Journal of Testing, 2(3-4), 277-295.

Habing, B., Roussos, Louis A.(2003). On the need for negative local item dependence. Psychometrika, 68(3), 435-451.

Haertel, E.H. (2006). Reliability. In R.L. Brennan (Ed.). Educational Measurement (4th ed., 65-110). Westport, CT: American Council on Education and Praeger.

Hambleton, R.K. & Murray, L.N. (1983). Some goodness of fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory (pp.71-94). Vancouver BC: Educational Research Institute of British Columbia.

Hambleton, R.K.& Swaminathan, H.(1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic Publishers.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. (Vol. 2). Newbury Park, CA: Sage Publications.

Hendrickson, A.(2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26(2), 44-52.

Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577-602

Ip, E.H., Smits, D.J.M., De Boeck,P.(2009). Locally dependent linear logistic test model with person covariates. Applied Psychological Measurement, 33(7), 555-569.

Jang, E.E., Roussos, L.(2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44(1), 1-21.

Keller, L.A. , Swaminathan, H., &Sireci, S.G.(2003). Evaluating Scoring Procedures for Context-Dependent Item Sets1, Applied Measurement in Education, 16(3), 207 – 222

Kim, J.K. & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psychometrika, 58, 587-599.

Lee, G.M., Frisbie, D. A.(1999). Estimating reliability under a generalizability theory model for test scores composed of testlets. Applied Measurement in Education,12(3). 237-255.

96

Lee, G.M. (2000). A comparison of methods of estimating conditional standard errors of measurement for testlet-based test scores using simulation techniques.; Journal of Educational Measurement, 37(2), 91-112.

Lee, G.M. (2000). Estimating conditional standard errors of measurement for tests composed of testlets. Applied Measurement in Education, 13(2), 161-180.

Lee, G.M. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied psychological measurement, 25(4), 357-372.

Lee, G.M., Dunbar, S.B., Frisbie, D.A. (2001). The relative approapriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61(6), 958-975.

Li, Y.M. (2005). A Test Characteristic Curve Linking Method for the Testlet Model. Applied psychological measurement,29(5), 340-356.

Li, Y.M. (2006). A Comparison of Alternative Models for Testlets. Applied psychological measurement. 30(1), 3-21.

Loevinger, J. (1947). A systematic approach to the construction and evaluation of tests of ability. (Psychological Monographs 61, No.4). Richmond, VA: Psychometric Society.

Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No. 7.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.

Lord, F. M., Novick, M. R. (1968) Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.

Luecht, R., Brumfield, T., Breithaupt, K. (2006). A Testlet Assembly Design for Adaptive Multistage Tests. Applied Measurement in Education, 19(3), 189-202.

Mair, P., Hatzinger, R. (2007). Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R. Journal of Statistical Software, 20(9), 1-20.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.

Meijer, Rob R. (2004). Using Patterns of Summed Scores in Paper-and-Pencil Tests and Computer-Adaptive Tests to Detect Misfitting Item Score Patterns. Journal of Educational Measurement, 41(2), 119-136.

Mislevey, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195.

97

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.

Pitt, M.A., Kim, W., & Myung, I.J.(2003). Flexibility versus Generalizability in Model Selection. Psychonomic Bulletin & Review, 10, 29-44.

Pomplun, M., Ritchie, T. (2004). An Investigation of Context Effects for item Randomization within Testlets. Journal of Educational Computing Research, 30(3), 243-254.

Ponocny, I. (2001) Nonparametric goodness-of-fit tests for the rasch model. Psychometrika, 66(3), 437-460

Puhan, G. Moses, T.P., Grant, M.C., McHale, F. (2009). Small-sample equating using a single-group nearly equivalent test (SiGNET) design. Journal of Educational Measurement, 46(3), 344-362.

R Development Core Team (2006). R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URLhttp://www.R-project.org

Rae, G. (2008). A note on using alpha and stratified alpha to estimate the reliability of a test composed of item parcels. British Journal of Mathematical and Statistical Psychology, 61(2), 515-525.

Rivera, C., Stansfield, C.W.(2003). The effect of linguistic simplification of science test items on score comparability. Educational Assessment, 9(3-4), 79-105.

Rosenbaum, P. R. (1988). Item bundles. Psychometrika, 53, 349-359.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 17, 1-100.

Schmitt, N. (2002). Do reactions to tests produce changes in the construct measured? Multivariate Behavioral Research, 37(1), 105-126.

Sheehan, K. M., Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16(1), 65-76.

Shen, L., Yen, J. (1997). Item dependency in medical licensing examinations. Academic Medicine.72, S, S19-S21

Sireci, S. G., Thissen, D.,&Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237-247.

Stark, S., Chernyshenko, O.S. & Drasgow, F. (2004) Investigating the effects of local dependence on the accuracy of IRT ability estimation. Technical Report Series two. American Institute of Certified Public Accountants.

98

Steinberg, L., Thissen, D.(1996).Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1(1), 81-97.

Thissen, D., Billeaud, K., McLeod, L., & Nelson, L (1997). A brief introduction to item response theory for items scored in more than two categories. Paper presented at the National Assessment Governing Board Achievement Levels Workshop, Boulder, CO.

Thissen, D., Steinberg, L., & Mooney, J. (1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement, 26, 247-260.

Thissen, D. (2008). Review of 'Testlet response theory and its applications.'. Journal of Educational Measurement, 45(3), 305-308.

Tokar, D. M.; Fischer, A.R., Snell, A.F., Harik-Williams, N. (1999). Efficient assessment of the five-factor model of personality: Structural validity analyses of the NEO Five-Factor.

Tong, Y., Kolen, M.J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253.

Van den wollenberg, A. L. (1982). Two new test statistics for the Rasch model. Psychometrika, 47, 123-140.

Vitacco, M. J.(2005). A Comparison of Factor Models on the PCL-R with Mentally Disordered Offenders: The Development of a Four-Factor Model. Criminal justice and behavior, 32(5) 526-545.

Wainer. H., Lewis. C. (1990). Toward a Psychometrics for Testlets. Journal of Educational Measurement. 27(1), 1-14.

Wainer, H, Lewis, C, Kaplan, B., Braswell, J.(1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement, 28(4), 311-323.

Wainer, H. & Kiely, G, L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201.

Wainer, H. & Mislevey, R. J. (2000). Item response theory, item calibration, and proficiency estimation. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed.) Mahwah, NH: Lawrence Erlbaum Associates.

Wainer, H., Sireci, S.G. Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28(3), 197-219.

99

Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 LawSchool Admissions Test as an example. Applied Measurement in Education, 8, 157-186.

Wainer, H., &Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203-220.

Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model using in testlet-based adaptive testing. In W. van der Linden & C. A.

W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245-

269). London: Kluwer.

Wang, W.-C., & Wilson, M. (2005a). Exploring local item dependence using a random- effects facet model. Applied Psychological Measurement, 29(4), 296–318.

Wang, W.-C., & Wilson, M. (2005b). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126–149.

Wang, W.C. (2005). Assessment of Differential Item Functioning in Testlet-Based Items Using the Rasch Testlet Model. Educational and psychological measurement, 65(4), 549-579.

Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109-128.

Wang, X.H., (2002). A general Bayesian model for testlets: Theory and applications. Applied psychological measurement, 26(1), 109-128.

Weaver, C.M., Meyer, R.G., Van Nort, J.J., Tristan, L. (2006). Two-, Three-, and Four-Factor PCL-R Models in Applied Sex Offender Risk Assessments. Assessment, 13(2), 208-216.

Wilson, M., Adams, R.J. (1995). Rasch Models for Item Bundles. Psychometrika, 60(2), 181-198.

Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalized item response modeling software. Melbourne, VIC: Australian Council for Educational Research.

Yang, W.L.& Gao,R. (2008). Invariance of Score Linkings Across Gender Groups for Forms of a Testlet-Based College-Level Examination Program Examination. Applied Psychological Measurement 32, 45

Yen, W. (1993). Scaling performance assessment: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213.

100

Zenisky, R. K., Hambleton, S.G. Sireci. (2002) Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test. Journal of Educational Measurement, 39(4), 291-309.

Zwick, R. (2002). Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied psychological measurement, 26(1), 57-77.

101

BIOGRAPHICAL SKETCH

Ou Zhang was born in Chengdu, China. He completed his Bachelor of Science in

computer science from Chengdu University of Technology in 2001 and his Master of

Education in Educational Research Measurement and Evaluation from Boston College

in 2007. He received his Master of Arts in Education degree from the program of

Research and Evaluation Methodology at University of Florida in the fall of 2010. He is

currently enrolled in the Ph.D. program of Research and Evaluation Methodology at

University of Florida.

polytomous irt or testlet model: an ... - ufdc image array...

Documents