polytomous irt or testlet model: an ... - ufdc image array...
TRANSCRIPT
1
POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS
By
OU ZHANG
A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN EDUCATION
UNIVERSITY OF FLORIDA
2010
2
© 2010 Ou Zhang
3
To my Dad who has supported me, believed me in, and encouraged me to start this long way. He is my hero!
4
ACKNOWLEDGMENTS
I would like to express my sincere appreciation to Dr. M. David. Miller, my
committee chair, for providing valuable guidance and continuous support. I would also
like to thank Dr. James J. Algina, my committee member, for sharing his ideas and
corrections on this project.
My deepest gratitude goes to my parents and my wife, Bei Li, for their constant
support and love. Thanks to my summer internship mentor, Dr. Feiming Li and Vice
President, Dr. Linjun Shen for giving me such a valuable opportunity to enter the
educational measurement industry. Thanks to my friend Yan Cao for her patience and
help over years. Last, thanks go out to Dr. Andrich for his comment and suggestion.
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS .................................................................................................. 4
LIST OF TABLES ............................................................................................................ 7
ABSTRACT ..................................................................................................................... 9
CHAPTER
1 INTRODUCTION .................................................................................................... 10
1.1 Model Selection ................................................................................................ 14
1.2 Survey of the Testlet Size in Applications of Testlet ......................................... 15
1.3 Purpose of the Study ........................................................................................ 16
2 LITERATURE REVIEW .......................................................................................... 19
2.1 Item Response Theory ...................................................................................... 19
2.1.1 IRT Assumptions ................................................................................... 20
2.1.2 One-Parameter Logistic Model (1-PL Model or Rasch Model) .............. 20
2.1.3 Polytomous Item Response Theory (IRT) Model-Partial Credit Model .. 21
2.1.4 Testlet Model-Rasch Testlet Model ....................................................... 22
2.1. 5 Local Item Dependence ....................................................................... 23
2.2 Reliability .......................................................................................................... 25
2.3 Survey in Application of Testlet ......................................................................... 26
3 METHOD ................................................................................................................ 34
3.1 Model Used to Generate Data .......................................................................... 34
3.2 Population Parameters ..................................................................................... 35
3.3 Condition Manipulated ...................................................................................... 35
3.4 Data Generation ............................................................................................... 36
3.5 Parameter Estimation ....................................................................................... 37
3.6 Ability Estimation............................................................................................... 39
3.7 Analysis ............................................................................................................ 41
3.7.1 Bias ....................................................................................................... 41
3.7.2 Root Mean Square Error (RMSE) .......................................................... 42
3.7.3 Reliability ............................................................................................... 42
4 RESULTS ............................................................................................................... 44
4.1 MLE Non-convergence Issue ............................................................................ 44
4.2 Test Reliability .................................................................................................. 44
4.3 Standard Error of Measurement ....................................................................... 46
4.4 Bias and RMSE ................................................................................................ 47
6
4.5 An Empirical Case ............................................................................................ 49
5 DISCUSSION ......................................................................................................... 90
5.1 General Discussion ........................................................................................... 90
5.2 Limitations and Suggestions for Future Research ............................................ 91
5.3 Conclusion ........................................................................................................ 92
LIST OF REFERENCES ............................................................................................... 94
BIOGRAPHICAL SKETCH .......................................................................................... 101
7
LIST OF TABLES
page 1-1 Testlet size in the article reviews ........................................................................ 18
2-1 The number of testlets in the dataset ................................................................. 29
2-2 Test length in the reviewed articles .................................................................... 30
2-3 Sample sizes in the reviewed articles ................................................................. 31
2-4 Fit indices in reviewed articles ............................................................................ 33
2-5 Estimation method in reviewed articles .............................................................. 33
2-7 The number of simulation replication applied in the reviewed articles ................ 33
3-1 Study design condition with 3 factors ................................................................. 43
4-1 MLE nonconvergence case and rate per condition-testlet size 3 ....................... 53
4-2 MLE nonconvergence case and rate per condition-testlet size 5 ....................... 53
4-3 Test reliability-testlet size 3 conditions ............................................................... 54
4-4 Test reliability-testlet size 5 conditions ............................................................... 55
4-5 Testlet size 3 the results of the Spearman-Brown prophecy .............................. 56
4-6 Testlet size 5 the results of the Spearman-Brown prophecy .............................. 57
4-7 Mean standard error of measurement for each condition (testlet size 3) ............ 58
4-8 Mean standard error of measurement for each condition (testlet size 5) ............ 59
4-9 Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) ...................... 60
4-10 Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) ...................... 62
4-11 Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 64
4-12 Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ............................................................................ 66
4-13 Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery
(EAP) with 6 different ability intervals ................................................................. 68
8
4-14 Rasch testlet model (Testlet Size 5) Bias of Ability ( ) Estimate Recovery
(EAP) with 6 Different Ability Intervals ................................................................ 70
4-15 Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP)
with 6 different ability intervals ............................................................................ 72
4-16 Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 74
4-17 Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery
(EAP) with 6 different ability intervals ................................................................. 76
4-18 Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery
(EAP) with 6 different ability intervals ................................................................. 78
4-19 Standard Rasch model (testlet size 3) RMSE of ability ( ) estimate recovery
(EAP) with 6 different ability intervals ................................................................. 80
4-20 Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery
(EAP) with 6 different ability intervals ................................................................. 82
4-21 Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals ................................................................. 84
4-22 Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals ............................................................................ 86
4-23 NBOME LEVEL-2 Block 1 Item WMSE .............................................................. 88
4-24 COMLEX-Level 2 2008 block-1 local item dependence detection results .......... 89
9
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Arts in Education
POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS
By
Ou Zhang
December 2010
Chair: M. David Miller Major: Research and Evaluation Methodology
This study investigated the effectiveness of ability parameter recovery for three
models to detect the influence of the local item dependence across testlet items under
the small testlet size situation. A simulation study was used to compare three Rasch
type models, which were the standard Rasch model, the partial credit model, and the
Rasch testlet model. The results revealed that both the partial credit model and Rasch
testlet model performed better than the standard Rasch model as the existence of local
item dependence within testlet. The results also indicated that as the sample size
increases, the discrepancies between model estimates and the real data set increases.
The study concluded that using the polytomous IRT model for testlet item analyses is
still efficient for small testlet size and non-adaptive typed tests. Moreover, for small
testlet sizes, polytomous IRT models are more stable than the Rasch testlet model
when there are a large number of the testlets included in a test. In sum, the polytomous
IRT model and Rasch testlet model offers an advantage over the standard Rasch model
as it avoids standard error of measurement underestimation and better ability parameter
estimations in the small testlet size situations.
10
CHAPTER 1 INTRODUCTION
Item response theory (IRT) models are commonly used in educational and
psychological testing. Employing item response theory allows for assessing latent
human characteristics and quantifying underlying traits. IRT holds a major assumption -
- local item independence. Local item independence (LID) assumes items in the test are
unrelated with each other, after controlling for the underlying trait. However, the LID
assumption can be commonly violated in real world applications. In fact, many real
world tasks require solving related problems or solving a single problem in stepwise
fashion. In accordance with such circumstances, the exam includes items within a
subset sharing a single content stimulus. The items sharing the same stimuli are
grouped as a unit, termed as an item bundle (Rosenbaum, 1988) or testlet (Wainer &
Kiely, 1987).
An item bundle or testlet, hence forward referred to as a testlet, is a scoring unit
within a test that is smaller than a test (Wainer & Kiely, 1987). Items within testlets are
locally dependent because they are associated with the same stimulus. Moreover, local
item dependence introduces unintended dimensions into the test at the construct of
interest‟s expense (Wainer & Thissen, 1996). Thus, the challenge for the test developer
is not to eliminate the item dependencies, but rather to find a proper solution so that
such local item dependence does not impact the test reliability and the validity of
inferences from the test. More specifically, the violation of the assumption of local item
independency may lead to an underestimate of the standard errors and could result in
(a) bias in item difficulty estimates, (b) inflated item discrimination estimates, (c)
overestimation of the precision of examinee scores, and (d) overestimation of test
11
reliability and test information. This last result can lead to inaccurate inferences that
may result in a greater chance of misclassification when making decisions regarding
examinee ability categorization (Sireci, Thissen, & Wainer, 1991; Yen, 1993).
Therefore, some models were proposed as solutions to the violation of the local
item independence assumption. One of the methods is to treat such testlet items as a
single super polytomous item in the analysis (Sireci, Thissen, & Wainer, 1991; Thissen,
Steinberg, & Mooney, 1989; Wainer, 1995). This method leans heavily on Rosenbaum‟s
theorem of item bundles (Rosenbaum, 1988) using a polytomous (IRT) model to score
the locally independent testlets. The key idea is that the items that form each testlet
may have excessive local dependence, but that once the entire testlet is considered as
a single unit and scored polytomously these local dependencies may disappear. The
item scores are summed within each testlet. When the total scores in a testlet are
identical, they will be assigned to the same category. This method allows researchers to
score testlets polytomously. Once the summed item scores are obtained, testlet type
item responses are calibrated by applying polytomous item response models, such as
the Graded Response Model (Samejima, 1969), the Partial Credit Model (Masters,
1982), the Rating Scale Model (Andrich, 1978), or the Nominal Response Model (Bock,
1972). In using a polytomous IRT model to score testlets, the data can be analyzed
while maintaining local independence across different testlets. This approach avoids the
overestimation of the test reliability and information so that the statistics of the
polytomous IRT model consistently perform better than the standard Rasch model in
such circumstance.
12
However, this approach has some weaknesses when it is applied to manipulate
testlet-type data set. Some major shortcomings of polytomous IRT models‟ have been
discussed (Thissen, Billeaud, McLeod, & Nelson, 1997; Yen, 1993; Wainer & Wang
2000). First, when polytomous IRT models are applied, some test information, the
precise pattern of responses the examinee generates, is lost. In addition, some
parameters are dropped from the polytomous model compared to the individual
dichotomous item-scoring. Third, it is inappropriate if the test is administered adaptively.
Last but not least, the test reliability might be underestimated (Yen, 1993). Wainer
(1995) claimed that using a polytomous IRT model to manage testlets might be
appropriate when the local dependence between items within a testlet is moderate and
the testlet-type items only take a small proportion of the entire test.
The other method, the testlet model (Wainer & Kiely, 1987), is explicitly introduced
as an alternative to the polytomous IRT model and attempts to solve the same problem.
IRT testlet models have been proposed in which a random effect parameter is added to
model the local dependence among items within the same testlet (Bradlow, Wainer, &
Wang, 1999; Wainer, Bradlow, & Du, 2000; Wang, Bradlow, & Wainer, 2002). As one
random effect parameter is added to the model, an additional latent trait is also added to
the testlet model. Thus, the testlet model proposed by Wainer& Wang (2000) is a
special case of a multidimensional IRT model (MIRT).
Proposed by Wang and Wilson (2005b), the Rasch testlet model is a special case
of testlet model (Wainer & Wang, 2000) by combining special features of the Rasch
model and the testlet model. In so doing it makes use of several desirable measurement
and psychometric properties of the Rasch model (Wang & Wilson 2005). First, the
13
Rasch model has observable sufficient statistics for the model parameters and a
relatively small sample size requirement for parameter estimation. Second, no
distributional assumption on the item parameters is necessary in Rasch models since
the items are treated as fixed effects. Therefore, the Rasch model is widely applied in
testing and scoring. Because of such advantages of the Rasch model, Wang and
Wilson (2005a, 2005b) showed that it is possible to model locally dependent items in
relation to testlets by using a Rasch testlet model so that more precise and adequate
estimates are obtained. The Rasch testlet model is the special case of the testlet model
(Wainer &Wang, 2000).
Before the testlet model was proposed, polytomous IRT models were the primary
method to analyze testlets. Currently, both approaches are widely used for testlet
analyses and no doubt, both approaches have pros and cons. Thus, the theoretical
reason to choose the testlet model (Wainer &Wang, 2000) over polytomous IRT model
in testlet analysis might be obvious. However, some potential caveats of the testlet
model should also be considered. First, the testlet model is more complex than both the
standard IRT model and the polytomous IRT model because it adds more “testlet
parameters”. Second, when one testlet parameter is added in the model, an additional
latent trait is also added in the model so that multidimensionality occurs and results in
increased complexity of analysis. Thus, the model analysis process is extremely
prolonged and some potential issues emerge (e.g. sometimes the model calibration fails
to converge). Therefore, the benefits in using the testlet model (Wainer& Kiely, 1987)
should be weighed against the added complexity in data analysis.
14
1.1 Model Selection
Although Wainer and Wang (2000) addressed advantages of the testlet model
over the polytomous model in applied testlet analyses, it remains important to compare
these two models under various conditions. Pitt, Kim, and Myung (2003) indicated that
the goal of model selection was not just to find the model that provide the maximum fit
to a given data set, but to identify a model from a set of competing models, that best
captures the characteristics or trends underlying the cognitive process of interest.
Briefly, the best model is the model that matches the purpose of the study and can
explain all of the important features of the actual data without adding unnecessary
complexity.
The model‟s analysis efficiency is the other issue that must be considered. Two
realistic circumstances should be noted in model selection for analyzing testlets. First is
that of testlet size. In previous testlet research, in order to obtain illustrative results to
support hypotheses, testlet sizes were usually set from 5 to 10 or more items (e.g.,
Adams,Wilson & Wang,1997; Wang & Wilson, 2005; Brandt, 2008; Wainer & Wang,
2000; Wainer & Lewis, 1990). Small and medium testlet sizes (2-4 items) were rarely
applied (Ip, Smits & De Boeck,2009; Tokar, Fischer, Snell & Harik-Williams, 1999;
DeMars,2006). This is potentially problematic because in some exams, like National
Board of Osteopathic Medical Examiner (NBOME)‟s Comprehensive Osteopathic
Medical Licensing Examinations (COMLEX)-USA exam, testlet sizes are often small.
The second issue to consider is that of non-adaptive tests. Non-adaptive tests are still
widely used in the educational and psychological measurement field. Because of the
small testlet sizes and non-adaptive features, the loss of response pattern information is
not that serious for this kind of tests. In this manner, more concerns are given for the
15
model comparison between two models when the aforementioned shortcomings of
polytomous IRT model applying in testlet analysis are minimal. In addition, the local
dependence effect ( )(id ) within testlet varies. Since the local dependence effect ( )(id )
is avoided when polytomous IRT models are applied, the extent to which the local
dependence effect ( )(id ) influences the fit of polytomous model in testlet analysis
should also draw attention.
1.2 Survey of the Testlet Size in Applications of Testlet
Very little research has focused on model comparison between the polytomous
IRT model and the testlet model initially proposed by Wainer and Wang (2000)
regarding model fit, ability parameter recovery, and test reliability as testlet conditions
change, especially when testlet size and local dependence effect ( )(id ) are at a
medium level. A review of the literature to identify the application of testlets was
conducted in the EBSCO Host and PsychInfo databases for the keywords “testlet” and
“testlets” to identify studies that included testlet characteristics in the test between 1989
and 2009. A total of fifty-five articles relevant to the testlet were found and reviewed
(see the reference list from Appendix B) .Among all fifty-five testlet-related articles, forty-
five articles have specific descriptions regarding the factors that could influence the
testlet analysis in testlet research designs (i.e. testlet size, the number of testlets within
a test, sample size, etc.). The remaining ten articles, which include two book reviews,
conceptually describe testlet theory and application.
Issues of testlet size within the testlet have been well documented in the literature.
In these forty-five testlet-relevant articles, only four articles solely applied small testlet
size designs (i.e. testlet size smaller than five). The other forty-one articles have a
16
mixture of testlet size designs, although most of the articles included moderate and
large testlet size designs (testlet size larger than 5) in their research. Of these forty-one,
there were twelve articles that considered the small testlet size designs. Over thirty-five
articles included the testlet sizes between 5 and 10 and twelve articles included large
testlet size conditions, larger than 10. Overall, 16 articles (35.6%) investigated small
testlets and only 12 compared the small and medium testlet sizes. The detailed results
are shown in Table 1-1.
In sum, this study adds to this literature by investigating the results of three
different models of testlet-type data under the small and medium testlet size
circumstances (i.e. testlet size small than or equal to 5). Testlet size, local dependence
effect, sample size, and the ratio of testlet/independent items are factors in this study.
We examine model fit, test reliability, and the ability parameter recovery of the three
different models (i.e., Rasch model, Partial Credit model, and Rasch testlet model)
employed in a testlet-type data analysis.
1.3 Purpose of the Study
In accordance with previous testlet research, one of the research purposes
inherent to this study is exploring the consequences of variation in testlet size and local
dependence effects on test reliability, standard error of measurement, and ability
parameter recovery of the standard Rasch model, the Partial Credit model, and the
Rasch testlet Model. By looking for the trend of how changes in testlet factors (i.e.
testlet size, local dependence effect, sample size, testlet/independent item ratio) affect
different models‟ estimations and the test reliability corresponding to the models, a
guide for model selection is expected to emerge.
17
The other essential goal of this study is to determine which model performs the
best at person ability parameter recovery by considering the trade-off of the test
reliability and analysis complexity. An answer to these questions will be useful to
provide evidence as a reference for researchers interested in applying IRT models to
measure tests appropriately. Furthermore, since we use data from the NBOME
COMLEX-USA examination, it will provide guidance for future improvements in the
estimation of this exam.
18
Table 1-1. Testlet size in the article reviews
Testlet size (m) m <5 5<m<10 11<m<15 16<m<20 21<m<25 m>25
articles 16 35 6 2 3 1
proportion 35.56% 77.78% 13.33% 4.44% 6.67% 2.22%
Note: m is the number of items in the testlet
19
CHAPTER 2 LITERATURE REVIEW
In this section, the theoretical framework of this research is given. Several
important parts are included: IRT theory, IRT assumption, IRT models used in this
research, local item dependence, and test reliability.
2.1 Item Response Theory
Item Response Theory (IRT), proposed by Lord (1952), is a family of statistical
models for analyzing item responses in a population of individuals. It depicts the
relationship between examinees and items through mathematical models (Wainer &
Mislevey, 2000). Many mathematical models can be developed within the IRT
framework. There are two general types of IRT models, dichotomous IRT models and
polytomous IRT models. Dichotomous IRT models are used to model items with only
correct or incorrect response option. One-Parameter Logistic (1PL), Two-Parameter
Logistic (2PL), and Three-Parameter Logistic (3PL) IRT models are three common
dichotomous IRT models.
Items with more than two response options can be modeled with polytomous IRT
models. Among the polytomous IRT models already suggested, examples of
polytomous IRT models include the Graded Response model (GRM; Samejima, 1969),
the Rating Scale model (RSM; Andrich, 1978), the Partial Credit model (PCM;
Masters,1982), the generalized Partial Credit model (GPCM; Muraki, 1992), and the
Nominal Response model (NRM; Bock, 1972). The noticeable feature of IRT over
classical test theory is that IRT models are invariant to item and ability parameters
(Hambleton, Swaminathan & Rogers, 1991). According to this invariance feature, item
parameters (e.g., difficulty, discrimination and guessing) are not dependent on the
20
ability distribution of any particular group of examinees and the examinee ability
parameters (θs) are not dependent on a specific set of test items.
2.1.1 IRT Assumptions
Two essential a priori assumptions are held by Item Response Theory. The first
assumption of IRT is local item independence: the probability of a correct response to
one item is independent from other items. Local item independence means that the item
responses are independent for a given value of latent trait . The joint probability of a
response pattern for all items in the test is the product of the probabilities of correct
responses to the items for a given latent trait .
)|1()|1( N
i
iXPXP (2-1)
where N is the total number of items.
The second assumption of most general IRT models (e.g. 1PL-, 2PL-, 3PL-
models) is unidimensionality. Early notions of IRT require that the same construct
should be measured by all test items (Loevinger, 1947). As such, all items in the test
only measure a single latent trait (Hambleton & Murray, 1983; Lord, 1980).
2.1.2 One-Parameter Logistic Model (1-PL Model or Rasch Model)
The Rasch model (Rasch, 1960) is the simplest of unidimensional models. The
Rasch model predicts the probability of success for person j on item i and can be given
by the formula:
)exp(1
)exp(),|1(
ij
ij
ijijb
bbyP
(2-2)
where
ijy is examinee j ‟s response category to item i ;
21
j is examinee j ‟s proficiency level;
ib is the difficulty parameter of item i , which indicates the point on the ability
continuum when an examinee has a 50% probability of answering item i correctly.
),|1( iji j byP is the probability that examinee j answers item i correctly, by given
proficiency level j ;
An assumption that is implicit in the model is that all items have the same
discrimination value.
2.1.3 Polytomous Item Response Theory (IRT) Model-Partial Credit Model
In this study, for comparison with the standard Rasch model, the Partial Credit
model was selected as the polytomous IRT model. The Partial Credit model (PCM;
Masters, 1982) was originally developed for analyzing test items that require multiple
steps and for which it is important to assign partial credit for completing several steps in
the solution process. This model was designed to be used when partial credit can be
awarded for degrees of success. The PCM is a divide-by-total or “direct” IRT model.
The Partial Credit model can be considered as an extension of the Rasch Model and
has all the standard Rasch model features.
The equation for the partial credit model is shown below,
im
r
r
k
ik
x
k
ikj
jixP
0 0
0
)(exp
)(exp
)(
(2-3)
where
item i is scored imx ,...,0 for an item with 1 ii mK response categories;
22
ik ( imk ,...,1 ) is called the item step difficulty; it is associated with a category
score
of k and x is the response category of interest.
j is examinee j ‟s proficiency level;
)( jixP is the probability that examinee j answers item i at category x correctly,
by given proficiency level j ;
x
k
ik
0
0)( (2-4)
2.1.4 Testlet Model-Rasch Testlet Model
To model examinees‟ responses to testlet items, IRT testlet models have been
proposed in which a random effect parameter is added to model the local dependence
among items within the same testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow,
& Du, 2000; Wang, Bradlow, & Wainer, 2002). Following this general approach, a
simplified testlet model was generated by Wang and Wilson (2005) and it can be written
as
)exp(1
)exp(
)(
)(
1
jidij
jidij
jib
bP
(2-5)
where 1jiP is the probability that examinee j answers item i correctly (scoring 1);
)1,0(~ Nj is the ability of examinee j ;
),(~ 2
bbi Nb is the difficulty of item i , and
),0(~ 2
)( )(idNjid is a random effect that represents the interaction of person j
with testlet )(id (i.e., testlet d that contains item i ).
23
2.1. 5 Local Item Dependence
As we mentioned before, local item independence (LID) is the first a priori
assumption of IRT models. It means that the item responses are conditionally
independent given the latent trait. Therefore, there should not be any correlation
between two items after controlling for the underlying trait. The items should only be
correlated through the latent trait that the test is measuring (Lord and Novick, 1968).
However, this LID assumption is nearly always violated in real applications. Sometimes,
significant correlation among items remains after controlling for the effect of the latent
trait. Because of these significant correlations, the items are locally dependent or there
is a subsidiary dimension in the measurement that is not accounted for by the
overarching dimension trait. Locally dependent items are always the cause of
information loss for IRT models (Chen & Thissen, 1997).
Several indices have been proposed to detect local item dependence for
dichotomous item response models. Yen (1984, 1993) introduced the 3Q statistic by
comparing it with other traditional measures: 1Q (Yen, 1981), 2Q (Van den Wollenberg,
1982), and Signed 2Q (Van den Wollenberg, 1982). The 3Q statistic is the inter-item
correlation between item pairs once the effect of the latent trait is removed. Although the
3Q statistic has been commonly used for several years, it has two major deficiencies in
applied settings. First, the 3Q statistic requires a latent trait computation prior to
calculating the item pair residual correlation. Second, the entire set of test data must be
applied to compute the 3Q statistic. Therefore, Chen and Thissen (1997) proposed four
innovative LID indices to compute the expected frequency from IRT models. The
24
calculation of these four local dependence indices uses a subset of items without using
the estimates. These four LID indices are Pearson‟s 2X , Likelihood ratio2G ,
Standardized coefficient difference, and Standardized log-odds ratio difference .
These four indices are defined for a pair of items.
Ponocny (2001) proposed a general family of conditional nonparametric tests to
detect the differences between groups and items for Rasch models including item‟s
local stochastic independence (e.g., 1T ). By creating a two-by-two table for two items,
the comparison can be subjected to the standard 2 -test or Fisher‟s exact test (Fischer,
1974). This test is able to detect the difference of covariance between the item pairs.
The test of the local independence assumption can be conducted via a suitable
contingency table (Ponocny, 2001). The general family of conditional nonparametric
tests is implanted in the following extension of the Rasch model:
n
v
k
i iv
n
v
k
i iivv srATAL
1 1
1 1
))exp(1(
))()()(exp(),|(
(2-6)
where v is the examinee‟s ability parameter;
vr is the examinee‟s raw score;
is is the item marginal sum for the item difficulty parameter i ( ki ,...,1 ).
The random variable T is a sufficient statistic for the parameter which expresses
a certain violation of the Rasch model (Ponocny, 2001). Based on the conditional
nonparametric tests from Ponocny (2001), the local item dependence is demonstrated
by inter-item correlation between item iI and item jI ( Kjiji ,, ). The inter-item
correlation is based on the ( 22 )-table by calculating the cases with equal responses
25
on both items (Ponocny, 2001). The statistic )(1 AT is applied for the local item
dependence detection as below:
vjvixxAT )(1 (2-7)
where vjvixx indicates the Kronecker symbol with 1
vjvi xx for vjvi xx and 0
vjvi xx
otherwise. Then, a goodness-of-fit test is conducted to check the proportions of the
correlation comparison between the model-implied estimates and the observed value
from the matrix for two specific items. The sum of the1T 's over the item pairs serves as
a test statistic when two or more item pairs are investigated simultaneously (Ponocny,
2001). In the meantime, an overall test statistic ( 11T ) for the local dependence of test is
given by summing up the absolute deviation from the expected value ij to all inter-item
correlations ijr in the test. The test statistic 11T is shown as below (Ponocny, 2001):
ij ijijrAT )(1 1 (2-8)
2.2 Reliability
In educational measurement, reliability is a statistical index to quantify and
evaluate the consistency of test scores. If the local item independence assumption is
violated, the measurement errors are underestimated so as to give an inflated reliability
estimate. The circumstances where the local item independence assumption is violated
commonly occur in testlets. The test construct is subject to the impact of measurement
errors that are not related to the latent traits the test construct intends to measure. Thus,
these measurement errors determine how reliably the test measures the construct. Test
reliability has been consistently mentioned in previous testlet research. A concern about
test reliability was expressed regarding the creation of super polytomous items to
26
manipulate testlets (Keller , Swaminathan, & Sireci, 2003). The approach that treats
these testlets as polytomous items may lose the information contained in the response
pattern so that the measurement errors may increase and reduce the overall test
reliability (Keller et al., 2003). In addition, compared to the original dichotomous items,
some parameters are dropped when the polytomous items are formed so the test
reliability may decrease (Zenisky, Hambleton & Sireci, 2002). Yen (1993) also claimed
that, when items are combined into testlet scores and some of the items within a testlet
are locally dependent, the reliability will be underestimated. Thus, the comparison of the
test reliabilities among the three models is especially necessary for model selection.
2.3 Survey in Application of Testlet
The review of the testlet applied literature in the EBSCO Host and PsychInfo
databases also identified the other possible factors that impact the application of testlets
models. The testlet/independent item ratio within a test in terms of testlet number is
another important factor in testlet research. Among the forty-one articles in which testlet
numbers are specified, the general mean of the testlet numbers including sub-
conditions within each article is 7.9 and the standard deviation of the testlet number was
6.70. The largest testlet number design was fifty, which occurred in Wainer and Wang‟s
(2000) article. There is one other study containing a large testlet number in their
research design. Tokar, Fischer, Snell, and Harik-Williams (1991) included twenty
testlets in their research. Except for these two large testlet number designs, all the other
articles (39) contained three to fifteen testlets (e.g., Wainer, Lewis, 1990; Thissen,
Steinberg & Mooney, 1989; Wang, Cheng & Wilson, 2005; Wainer, 1995; Yang & Gao,
2008). This range gave clear guidance for this study‟s research designs. The detailed
27
information on testlet numbers used in previous testlet studies is demonstrated in Table
2-1.
Based on the same literature review, forty-three out of fifty-five studies identified
test lengths in their research designs. Of all available forty-three articles, the distribution
of test length ranged from 13 to 899. The mean test length (64.74) was obtained by first
removing the largest test length (i.e. 899) from Wainer and Wang‟s article (2000);
summing the remaining test lengths and dividing by the number of articles in which the
test length was included in the design (Table 2-2).
The research sample size is the third factor that can influence the analysis of
testlets. From the testlet application literature, thirty-seven articles identified the sample
size, with a mean of 2047.22 and standard deviation of 2105.86. Since some studies
consist of extremely large sample size (i.e. 8912 in Brandt‟s 2008 article and 8494 in
Zenisk, Hambleton & Sireci‟s 2002 study), the median sample size of 681 may be
more illustrative. The range of the sample sizes provides a guideline for our research
design. First, in twelve out of the thirty-seven articles reviewed, researchers included
sample sizes smaller than 500 (e.g., Adams, Wilson & Wang, 1997; Wang, 2005;
Schmitt, 2002). Second, eighteen articles included sample sizes between 500 and 1000
(e.g., Adams, Wilson & Wang, 1997; Stark, Chernyshenko & Drasgow, 2004). Finally,
twenty studies included sample sizes larger than 1000 (e.g., Brandt, 2008; Wainer &
Wang, 2000; Thissen, Steinberg & Mooney, 1989). Table 2-3, details the information on
sample sizes used in previous studies.
Seventeen studies used the RMSE and loglikelihood ratio coefficient as the
extraction criteria (e.g., Stark, Chernyshenko & Drasgow, 2004; DeMars, 2006;
28
Armstrong, 2004). The next most commonly used criteria were reliability coefficient and
Bias (used by nine, and 5 papers respectively) (e.g., Stark, Chernyshenko & Drasgow,
2004; DeMars, 2006; Armstrong, 2004; Davis, 2003; Schmitt, 2002 ). Other various
indices (e.g. AIC, WMSE, RMSEA, NNFI, CFI, GFI, Q3, RMS, etc) were used in twenty
studies (e.g., Gessaroli, Folske, 2002; Schmitt, 2002; Adams, Wilson& Wang, 1997).
Clearly, most researchers relied on the RMSE and loglikelihood ratio coefficient to
compare the model fit and parameter estimates .Table 2-4, reveals detailed information
on the fit criteria used.
Finally, the estimation methods were designated in the twenty-nine studies.
Twenty-four of these articles applied the Marginal Maximum Likelihood (MML) method
(e.g., Lee, 2006; Wang & Wilson, 2005; Wainer, 1995). Only five articles used the
Markov Chain Monte Carlo (MCMC) method (e.g., Li, 2005; Li, 2006; Wang, 2002;
Wainer & Wang, 2000). The data analysis iterations were only acknowledged in eleven
articles (e.g., Lee, 2000; Ip, Smits & De Boeck, 2009; Stark, Chernyshenko & Drasgow,
2004). Among these eleven articles, five of them applied 100 iterations (e.g. Stark,
Chernyshenko & Drasgow, 2004; DeMars, 2006) and only two articles applied even
more (200 and 600) iterations (Li, 2006; Zwick, 2002). Tables 2-5 and 2-6 include
detailed information on the estimation method and iteration times used for all the studies
reviewed.
29
Table 2-1. The number of testlets in the dataset Articles Testlet number Testlet number mean/article
1 2 3
2.5 2 4 8
6.0
3 5
5 4 6
6
5 4
4 6 15
15
7 4
4 8 5
5
9 5
5 10 5
5
11 5 10
7.5 12 7
7
13 4 8 10
7.3 14 6
6
15 6 10
8.0 16 4
4.0
17 2
2.0 18 9 16 14 11 12.5 19 8
8.0
20 16
16.0 21 8 9
8.5
22 5 10
7.5 23 6 3 2
3.7
24 4 5 7 8 6.0 25 4
4.0
26 11
11.0 27 5 7 8
6.7
28 5 7 8
6.7 29 50 36
43.0
30 7
7.0 31 4 5 7 8 6.0 32 3 6
4.5
33 20
20.0 34 4 5 9
6.0
35 5
5.0 36 2 5 6 10 5.8 37 4
4.0
38 4
4.0 39 10
10.0
40 10
10.0 41 10
10.0
Testlet number general mean 7.9 SD of mean 6.70
30
Table 2-2. Test length in the reviewed articles Articles Test Length Mean length
1 64
64 2 194
194
3 76
76 4 120
120
5 22
22 6 50
50
7 60
60 8 125
125
9 30 33 24 40 47 36 40 42 50 51 54 40.64 10 25 50 15
30
11 150
150 12 20
20
13 13 17 18
17.50 14 30 50
40
15 60
60 16 64
64
17 60 90
75 18 35 41 35 35
36.50
19 55
55 20 101
101
21 55 63
59 22 150
150
23 30
30 24 38 26 46 56 40
41.20
25 40
40 26 50
50
27 49 33 36
39.33 28 49 36 43 33
40.25
29 690 290 30 42
42
31 38 46 26 30 43 43
37.67 32 60
60
33 60
60 34 44 57 26 33 44
40.80
35 60
60 36 20
20
37 60 125
92.50 38 30
30
39 75
75 40 75
75
41 20
20 42 120
120
43 137
137 General mean 64.77
31
Table 2-3. Sample sizes in the reviewed articles Article
No. sample size in paper sample size mean/paper
sample size <500
500 < sample size <1000
sample size >1000
set 1 set 2 set 3 set 4 set 5 set 6
1 700 300
500.00 1 1 2 200 500
350.00 1 1
3 8912
8912
1 4 3866
3866
1
5 1210 589 352
717.00 1 1 1 6 4000
4000
1
7 500 1000 2000 5000
2125.00 1 1 1 8 1000
1000
1
9 500 2000 8000
3500.00 1 1 10 2000 5000
3500.00
1
11 2000
2000
1 12 5000
5000
1
13 300 500
400.00 1 1 14 1000 1392
1196.00
1 1
15 2000
2000
1 16 570 499 522 495
521.50 1 1
17 1000
1000
1 18 8026 8494
8260.00
1 1
19 3000 1000 5000
3000.00 1 1 20 1000 266
633.00 1 1
21 1996
1996
1 22 466
466 1
23 663 632 537 680 653 561 621.00
1 24 663 632 537 680 653 561 621.00
1
25 544
544
1 26 1000
1000
1
27 985 629 914 682 666 1000 812.67
1 28 1000
1000
1
29 485
485 1 30 3000
3000
1
31 100
100 1 32 1040
1040
1
33 4028
4028
1 34 4028
4028
1
35 500
500
1 36 10 15 25 50
25.00 1
32
Table 2-3. Continued Article
No. sample size in paper sample size mean/paper
sample size <500
500 < sample size <1000
sample size >1000
37 3000
3000
1 General mean
2047.22 12 18 20
Standard Deviation
2105.86 32.43% 48.65% 54.05%
General median 681
33
Table 2-4. - Fit indices in reviewed articles
Articles Bias RMSE Reliability coefficient
loglikelihood ratio test WMSE AIC
Other index
42 5 17 9 17 2 1 20
Percentage 11.90% 40.48% 21.43% 40.48% 4.76% 2.38% 47.62%
Table 2-5. Estimation method in reviewed articles
Estimation method MML MCMC total
Articles 24 5 29
percentage 82.76% 17.24%
Table 2-7. The number of simulation replication applied in the reviewed articles
Replication number 10 100 200 600 1000 Total
Frequency 1 5 2 2 1 11
9.09% 45.45% 18.18% 18.18% 9.09%
34
CHAPTER 3 METHOD
A comprehensive review of the testlet research from 1989 to 2009 provides us a
systematic framework for exploring the performance of three different IRT models to
analyze testlets. These three models will be a part of two studies presented in this
paper. The first is a series of simulation studies designed to investigate the extent to
which the fluctuation of testlet conditions (testlet size, local dependence effects, etc.)
influence the different model fitting results. Simulations are conducted to evaluate model
fit, test reliability, and parameter recovery of the three different IRT models. Next, a real
data analysis of the COMLEX-USA exam dataset is presented by fitting different models
as an empirical case. The three one parameter IRT models adopted in the study are:
the Rasch model, the Partial Credit Model, and the Rasch testlet model.
3.1 Model Used to Generate Data
The current study evaluates the effect of changes in the local effect of testlets on
the model fit, ability parameter recovery, and test reliability of three different IRT
models. In order to quantify the extent of the local effect, the application of Rasch testlet
model is appropriate for research data simulation. The Rasch testlet model (Wang&
Wilson, 2005) includes a testlet parameter ( jid )( ) which is the random effect capturing
the interaction of person j with testlet )(id when the overarching latent trait is held
constant. According to the definition of the testlet, the sum of testlet parameters ( jid )( )
over examinees within any testlet is zero
( ),0(~ 2
)( )(idNjid ). Thus, the local effects of testlets in the Rasch Testlet model
are simulated from the normal distribution with a mean of zero, and standard deviation
35
of the square root of the given local effect values ( 2
)( id ). The following prior model
constraints are used to simulate the responses.
With v=1,…,V and V the total number of examinees,
01
)(
D
d
ivd for all v = 1,…,V. (3-1)
0),( )( ivdV for all Dd ,...,1 (3-2)
0),( )()( jvdivd for all Dd ,...,1 (3-3)
3.2 Population Parameters
Population item parameters for the Rasch testlet model and the ability parameters
for the population are simulated from the normal distribution with the mean of zero, and
standard deviation of one (i.e. )1,0(~ Nj within a range from negative three to positive
three; ]3,3[ ). For each condition, the population item difficulty parameters are generated
from the mean of zero, and standard deviation of one ( )1,0(~ 2 bbi Nb ) with a
range of ]3,3[ . For simplicity, all simulated population parameters are rounded to three
decimal places. The population item parameters and population ability parameters are
randomly drawn from these two normal distributions ahead of each condition.
3.3 Condition Manipulated
In this study, we examine whether fluctuations of testlet size, local dependence
effects, and item difficulty within testlets affect the reliabilities and the model fit of three
different IRT models. Our study is a four-factor completely crossed design:
2 (changes in testlet size)4 (levels of local dependence effect) 3 (ratio of testlet
items and general items in test)3 (sample size). Table 3-1 demonstrates all the 72
36
conditions and the interactions of these four factors effect on the testlet research
designs.
1. The first factor is the testlet size. The testlet sizes chosen for this study are based on the purpose of the study and the sizes less often discussed in the applied literature. Thus, two patterns of testlet size including small and medium
testlet sizes are used in this study: )( 53, .
2. The second factor is the local dependence effect. Local dependence effects from the ten reviewed studies are within the range of zero to one (Wainer & Wang, 2000; Wang, 1999; Wang, 2002; Wang, 2005; Habing & Roussos, 2003; Adams, Wilson & Wang, 1997; Wang & Wilson, 2005; DeMars, 2006; Li, 2005; Zenisky, Hambleton & Sireci, 2002). Therefore, four levels of local dependence effect will
be examined: )(2 10.75,0.5,0.25, .
3. The third factor is the ratio of testlet items to general items in the test. Among all 60 items in the test, the ratio of testlet items and general items will be
1):31,:13,:(1 .
4. The fourth factor is the sample size of the examinees. Of seventy-four study groups in forty-five different articles from the applied literature, the distribution of sample size ranged from 10 to 8912; with two sample sizes greater than 8000 and four sample size smaller than 50. By dividing the remaining sixty-eight sample sizes into three groups according to the size ranking; and taking the approximate mean value of the sample sizes in each category, we selected three sample sizes for use in this study: ( 1000500,250, ). These quantities represent
rounded approximations of the most common sample size found in the applied literature.
5. Test length is the other issue that must be considered ahead of the research design. The test length of this simulation is set to sixty (60 items per test), the approximate general mean of the test length among the reviewed testlet literature.
6. For each condition, based on the largest occurrence of the iteration times in the applied literature, the value of the replication time is selected. Thus, one hundred replications are applied with each condition.
3.4 Data Generation
The Rasch testlet model response data are generated using the statistical
software R 2.10. Response data were generated for 100 samples from a set of
population item parameters (60 items) and population ability parameters (1000 trait
37
value j ) for each condition. Local effects were given per each testlet accordingly. Each
simulee was assigned a known trait value j from the randomly selected population
ability parameters. By comparing the difference between the co-effect of local effects
within testlets plus the randomly selected population item parameters and the known
trait value j from each simulee, the probability of observing the response matrix
),...,( 1 NxxX from a sample of N independently responding examinees can be
represented as
),,|(),,|(),|( )()( jidji
i j
i j
i
jidii xPxPXP (3-4)
where N ,...,( 1 ), ),...,( 1 Jbb , and jid )( are all considered unknown, fixed
parameters.
Thus, a response matrix with all logical indicators was generated for each
replication within every condition. Then, a series of random numbers were given from a
uniform distribution that ranged from 0 to 1 to match the logical response matrix
accordingly. If the known trait value j is less than the co-effect of the item and testlet,
the logical indicator is false setting the simulee‟s response to 0. On the other hand, if the
known trait value j is larger than the co-effect of the item and testlet, the logical
indicator is true which sets the simulee‟s response to 1 This process repeats for every
item and every simulee in each of the 100 samples. Thus, 100 simulated responses are
generated for each condition accordingly.
3.5 Parameter Estimation
In the study, the parameters of the dataset in 3 different models (PCM, standard
Rasch model, and Rasch Testlet model) are analyzed using Marginal Maximum
38
Likelihood (MML) methods with ConQuest Version 2.0. The most frequently used
approaches to item parameter estimation for unknown trait levels are Joint Maximum
Likelihood (JML), Conditional Maximum Likelihood (CML), and Marginal Maximum
Likelihood (MML). Holland (1990) compared the different sampling theory foundations of
these three ML methods.
CML is possible only for the 1PL- model and is so computationally intensive as to
be impractical in many situations. JML has been used extensively in early IRT
programs. However, JML estimation also has some drawbacks for estimating IRT
models. First, the JML item parameter estimates are biased and inconsistent for fixed
length tests. Second, the JML standard errors are probably too small to handle the
unknown person trait level (Holland, 1990).
The most commonly used method for estimating the parameter of IRT models is
Marginal Maximum Likelihood (MML). In MML estimation, unknown trait levels are
estimated by expressing the response pattern probabilities as expectations from a
population distribution. MML has several advantages over the other two ML methods.
First, MML is applicable for all types of IRT models. Second, MML is efficient for tests
with different lengths. Third, the MML estimate of item standard errors may be justified
as good approximations of expected sampling variance of the estimates. Fourth,
estimates are available for perfect scores. In the previous literature, Marginal Maximum
Likelihood (MML) method is applied in 82.76% of the articles. Therefore, MML was
chosen for the parameter estimation for this study.
39
The simplified mechanism of MML is shown below. The prior knowledge about the
examinee distribution ( )(p ) is treated as a prior and the item difficulty parameter is
indicated as . That is, MML estimates of the item difficulty parameter ( ) maximize
i
i dpxpXL )(),|()|( (3-5)
Therefore, a posterior distribution ( )|( Xp ) is obtained for item parameters by
multiplying )|( XL by )(p (Mislevy, 1986):
)()|()|( pXLXp (3-6)
3.6 Ability Estimation
In the study, the simulees‟ performance on a test is scored based on their
responses to items and the IRT models. The estimation of the simulees‟ abilities is
performed by two different approaches in this study. These two approaches are
Maximum Likelihood Estimation (MLE; Lord, 1980) and Expected a Posteriori
Estimation (EAP; Bock & Mislevy, 1982).
The maximum likelihood estimation (MLE) is the most commonly used estimation
procedure for examinees‟ ability estimation. Based on the examinee‟s responses on the
test, MLE finds the value of the latent trait that maximizes the likelihood of an item
response pattern by holding the assumption that the item parameter values are known.
The likelihood of the latent trait , given an item response pattern ),...,,( 21 ixxx is denoted
as
I
i
iiix
PxxxL1
21 )()|,...,,( (3-7)
40
where )(iP represents the probability of a given response to item i and the number
I is the number of items in the test. Although, MLE is the most common approaches for
ability estimation, some drawbacks of MLE must be addressed. First, MLE is not
available for all-endorsed or all-not-endorsed item response patterns. If these two item
patterns exist, the results of MLE will go to infinity. Second, MLE may not converge
when some response patterns are abnormal (Bock & Mislevy, 1982).
Expected a posteriori (EAP) estimation is an efficient approach for examinee‟s trait
estimation. EAP is a Bayesian estimator with non-iterative process. Unlike the MLE,
EAP provides a finite estimation for all-endorsed or all-not-endorsed item response
patterns. In fact, EAP estimation indicates the mean of the posterior distribution. For
any test, a set of quadrature nodes ( rQ ) are defined for a fixed number of specified trait
. There is a probability density )( rQW corresponding to each quadrature node. The
EAP trait estimate is derived by
)]}()([{
)]()([
1
1
rr
N
r
rrr
N
r
QWQL
QWQLQ
(3-8)
where the )( rQL represents the exponent of the log-likelihood function evaluated at
each of the N quadrature nodes. However, some shortcomings of EAP should be
mentioned. First, there is a tendency for Bayesian estimates to regress toward the
mean of the prior distribution (Kim & Nicewander, 1993; Weiss, 1982). Since ConQuest
provides EAP estimates for both with- and without- regression, the EAP estimates
without regression were applied. The other shortcoming of EAP is that its estimation
accuracy is reduced by an improper prior distribution (Bock & Mislevy, 1982). Since
both MLE and EAP ability estimation approaches have their pros and cons,
41
respectively, the estimation of the simulees‟ abilities in this study is performed by both
methods simultaneously.
3.7 Analysis
In this study, each simulated data set was analyzed using ConQuest. Since, the
polytomous test response patterns from partial credit model are different from the
dichotomous test response patterns, the observed data are different from these two
types of models (i.e. polytomous IRT model, dichotomous IRT model). Thus, using the
loglikelihood ratio test and Akaike's information criterion (AIC) as measures of the
goodness of fit of model is inappropriate. Therefore, the accuracy of estimation for
ability parameters with regard to three different models was quantified via bias and root
mean square error (RMSE) across all replications. The local item dependences for the
real data were examined by the conditional nonparametric tests (1T ; Ponocny, 2001).
The test reliability coefficients were also calculated for the simulated data and the real
data.
3.7.1 Bias
Bias is defined as average difference in true and estimated parameters across all
people and items. An estimate of bias is calculated for each replication in each
condition, and an average bias of each condition in the simulation. Bias is
mathematically defined as:
n
bias
n
j
jj
1
ˆ
(3-9)
where the j is the true value of a item or person parameter;
j is the estimated value of that parameter ;
42
n is the total instances of that type of parameter within a replication (i.e. sample
size for ability ).
3.7.2 Root Mean Square Error (RMSE)
RMSE is a measure of absolute accuracy in parameter estimation. RMSE is
calculated for each parameter type in a replication, and an average for each condition is
found within each condition. RMSE is the square root of the average squared difference
between estimated and true parameters, and is mathematically defined as:
n
RMSE
n
j
jj
1
2)ˆ(
(3-10)
where terms in the equation are defined as they are with bias.
3.7.3 Reliability
In this study, test reliability coefficients were computed for item responses scored
dichotomously for both Rasch testlet model and standard Rasch model as well as item
responses scored polytomously for Partial Credit Model. As we use MML estimation in
ConQuest, the test reliability can be calculated as
)ˆ(
).()ˆ(
)(
)(Re
2
2
ˆ
2
S
esS
Var
VarliabilityTest
EAP
T
(3-11)
43
Table 3-1. Study design condition with 3 factors
Condition
Testlet size 5 Testlet size 3
sample size
Testlet Number
Local effect Condition
Sample size
Testlet Number
Local effect
1 1000 9 0.25 37 1000 15 0.25
2 0.5 38 0.5
3 0.75 39 0.75
4 1 40 1
5
6 0.25 41
10 0.25
6
0.5 42
0.5
7
0.75 43
0.75
8
1 44
1
9 3 0.25 45 5 0.25
10 0.5 46 0.5
11 0.75 47 0.75
12 1 48 1
13 500 9 0.25 49 500 15 0.25
14
0.5 50
0.5
15
0.75 51
0.75
16
1 52
1
17 6 0.25 53 10 0.25
18 0.5 54 0.5
19 0.75 55 0.75
20 1 56 1
21
3 0.25 57
5 0.25
22
0.5 58
0.5
23
0.75 59
0.75
24
1 60
1
25 250 9 0.25 61 250 15 0.25
26 0.5 62 0.5
27 0.75 63 0.75
28 1 64 1
29
6 0.25 65
10 0.25
30
0.5 66
0.5
31
0.75 67
0.75
32
1 68
1
33 3 0.25 69 5 0.25
34 0.5 70 0.5
35 0.75 71 0.75
36 1 72 1
44
CHAPTER 4 RESULTS
4.1 MLE Non-convergence Issue
In using the MLE and EAP estimation methods to estimate the simulees‟ abilities,
a large number of non-convergence cases arose in the Rasch testlet model results via
MLE estimation in the 1000 sample size condition. Additionally, such non-convergence
case pattern is also occurred across other sample sizes. The number and percentage of
non-convergence cases is displayed in Tables 4-1 and 4-2. After checking the non-
convergence cases response patterns, neither non-endorsed nor all-endorsed response
patterns were found. Therefore, this phenomenon may occur because of the complexity
of the multi-dimensionality of the Rasch testlet model. So, for precision purposes, only
EAP estimate results are used in this study.
4.2 Test Reliability
A summary of the test reliability analyses is presented in Tables 4-3 and 4-4.
Three columns of estimates are provided for each model of each condition. For most of
the conditions, the reliability estimates from standard Rasch model are higher than the
reliability estimates from both the Partial Credit model and the Rasch testlet model. The
association between test reliability and other factors are described as below.
First, the difference in test reliability estimates between the standard Rasch model
and the other two models indicates a strong association between the ratio of the
independent items to testlet items within a test andtest reliability overestimation. In
general, the magnitude of the test reliability analyzed from standard Rasch model is
higher than its corresponding coefficient from the other two models (from 0.01 to 0.08).
As the ratio of the independent/testlet items within a test decreases (i.e. a greater
45
proportion of testlet items are included in a test), the extent of test reliability
overestimation increases due to ignoring the local item dependence. This phenomenon
occurs because of the existence of locally dependent testlet items and results in the
overestimation of the test reliability estimates since the standard Rasch model assumes
items within a test are locally independent. Second, the difference in test reliability
estimates between the standard Rasch model and the other two models across different
sample sizes indicates a strong association between the sample size and test reliability
overestimation. As the sample size increases, the extent of test reliability overestimation
increases as well. No evident patterns were found to disclose the association between
test reliability local effect and testlet size.
As for the test reliability comparison between Partial Credit model and the Rasch
testlet model, no obvious differences were observed between the test reliability
estimates computed from these two models when the testlet size was held to three
items. Moreover, as the testlet size was set to five, under most of the circumstances,
the test reliability estimates from the Partial Credit model are slightly smaller than their
corresponding estimates from the Rasch testlet model, but the differences are generally
smaller than 0.01. This is because more parameters are dropped from the polytomous
model compared to the individual item-scoring models as the testlet size increases,
thereby decreasing the effective test length and decreasing the estimation of the test
reliability (Zenisky, Hambleton & Sireci, 2002). In addition, any association between the
magnitude of local item effect and the variation of the test reliability estimates was not
obvious in the results from this study.
46
The results of the Spearman-Brown prophecy are listed in Tables 4-5 and 4-6. The
values of Spearman-Brown prophecy from three models also indicated the effect of the
reliability overestimation by using standard Rasch model. If the test administration
claims that the testlet-based test satisfies some required test reliability level by using
overestimated test reliability coefficient from the standard Rasch model, these results
provide an estimate of the amount by which a testlet-based test would need to be
lengthened to achieve the same magnitude as the overestimated test reliability as the
standard Rasch model is applied. For sample size 1000 conditions, approximately over
4 times the test length increases (from 3.984 to 5.138) in the testlet-based test (i.e.
PCM, Rasch testlet model) would be needed to achieve the level of test reliability
(overestimated) indicated by applying the standard Rasch model. As the sample size
decreases to its half size (500), the magnitude to increase the test length to achieve the
overestimated test reliability is down to half as well. As the sample size decreases to its
quarter size (250), the magnitude to increase the test length to achieve the
overestimated test reliability is minimum but still positive.
4.3 Standard Error of Measurement
The magnitude of standard error of measurement (SEM) for three different models
was also used for model comparison in this study. Tables 4-7 and 4-8 list the mean of
SEM for all 72 conditions over 100 replications. In general, the values of mean SEM
obtained from standard Rasch model were smaller than the values of mean SEM
obtained from the other two models (i.e. Partial credit mode, Rasch testlet model), but
the differences were generally smaller than 0.02. This phenomenon occurs because
ignoring the local dependency within a testlet leads to an underestimate of the standard
errors.
47
The magnitude of the SEM underestimation might be influenced by different testlet
sizes. From the Tables 4-7 and 4-8, even holding the same level of the
independent/testlet item ratio, having a larger testlet size (i.e. testlet size 5), on average,
led to a larger extent of underestimation in SEM than having a smaller testlet size (i.e.
testlet size 3) circumstance. The quantitative differences of the SEM affected by the
tesetlet size difference were generally around 0.01. No obvious associations were
observed between the SEM estimates and the local effect variations across conditions.
4.4 Bias and RMSE
Tables 4-9 and 4-10 list the mean of the bias estimates and the mean of the
RMSE estimates for all 72 conditions over 100 replications. There are three sets of
estimates in each table corresponding to three different models (i.e. standard Rasch
model, partial credit model, Rasch testlet model) for the testlet size three and five
conditions. The means of RMSE (within a range from -0.01to 0.01) over all conditions
for testlet sizes 3 and 5 are small for the three different models, especially when
compared with the ability range of negative three to positive three. As found for the
three models, the magnitude of RMSE estimates is fairly satisfactory.
However, throughout the entire ability interval (i.e. [-3.0, 3.0]), the magnitudes of
the bias of all three models are relatively high for some conditions. In general, no
obvious associations were found between the testlet size and the magnitude of bias
estimates. The association between the ratio of the independent/testlet item within a
test (i.e. the number of the testlets) and the bias estimates were not found either.
In order to reveal how bias and RMSE changes as a function of ability variation,
the ability range is split into 6 intervals and the bias and RMSE estimates are calculated
accordingly. Table 4-11 to 4-16 display the mean bias estimates of ability ( ) estimate
48
recovery (i.e. EAP estimate) with 6 different ability intervals for three different models
overall 72 conditions. According to the results listed in the tables, relatively high
magnitude of positive bias was observed at the lowest ability interval level ( 0.2 ) for
all three models across all conditions. Meanwhile, relatively high magnitude of negative
bias was also found at the highest ability interval level ( 0.2 ) for all three models
(i.e. standard Rasch model, partial credit model, Rasch testlet model) across all
conditions. Since applying EAP estimation may result in the ability estimate distribution
leaning towards its mean, a possible cause for this high magnitude of bias at both ends
of the ability intervals might be the usage of the EAP estimates. Other than that high
magnitude of bias at both ends of the ability interval phenomena, no obvious patterns
and associations between mean bias variations and the major factors in this study were
found across three models.
In addition, Table 4-17 to Table 4-22 display the RMSE estimates of ability ( )
estimate recovery with 6 different ability intervals for three different models (i.e.
standard Rasch model, partial credit model, Rasch testlet model) overall 72 conditions.
Similar to the bias estimates, except for that relatively high magnitude of RMSE
estimates at both ends of the ability intervals, no obvious patterns and associations
between RMSE estimate variations and the major factors in this study were found
across three models either.
In sum, all three models (i.e. standard Rasch model, partial credit model, Rasch
testlet model) performed fairly well in ability estimates recovery on the basis of the
relatively low magnitude of bias and RMSE estimates from the analysis results.
49
4.5 An Empirical Case
The National Board of Osteopathic of Medical Examiners (NBOME) offers
computer-based COMLEX-USA exams online. This computer-based exam series is
designed to assess the osteopathic medical knowledge and clinical skills considered
essential for osteopathic generalist physicians to practice medicine without supervision.
The COMLEX-USA exam responses have been analyzed with the standard Rasch IRT
Model. The 2008 National Board of Osteopathic of Medical Examiners (NBOME)
COMLEX-USA Level-2 exam data is used as an empirical case for this study. The
COMLEX-USA level-2 exam consists of 350 items in 7 blocks including 141
independent items and 209 testlet items grouped in 95 testlets (all testlet sizes are
within 2-4 items). The item type is identified (i.e. A -single item, D-single Item with
graph, B-matching item, S-testlet item, F-testlet item with graph). The B, S, and F type
items are categorized as testlet items. Among all 95 testlets, there are 4 testlets with
matching items and 9 testlets with a graph. The testlet sizes range from 2 to 4. A total
of 450 examinees were included in the examinee population. No missing data exists.
The data of the first block of this exam (Block-1) is used for this study. Block-1 data
contains 50 items including 27 independent items and 23 testlet items within 10 testlets.
The data set was analyzed using the standard Rasch model, the Partial Credit
model, and the Rasch testlet model, separately. Table 4-23 lists the weighted mean
square errors (WMSE) for the 50 items in three models. In the output of Rasch testlet
model, the WMSE ranges from 0.86 to 1.19 ( 06.0,04.1 SDM ). The two items with
the most extreme WMSE in the Rasch testlet model are item 47 ( 86.0WMSE ) and
item 5 ( 19.1WMSE ) with a non-significant p-value (i.e. the item with 96.1WMSE is
50
treated as an item with a bad fit). That is, the item fit for all 50 items in this block is
acceptable. In the output of Partial Credit model, the WMSE ranged from 0.97 to 1.07 (
02.0,01.1 SDM ). The two items with the most extreme WMSE in the partial credit
model are item 40 ( 97.0WMSE ) and item 35 ( 07.1WMSE ). In the output of the
standard Rasch model, the WMSE ranges from 0.97 to 1.18 ( 03.0,01.1 SDM ). The
two items with the most extreme WMSE in the standard Rasch model are item 8 (
97.0WMSE ) and item 37 ( 07.1WMSE ). All these item WMSE estimates from three
models indicate that these 50 items have an objectively fair item fit. Therefore, we
should keep them in the test.
The estimates of test reliability for the overarching latent trait are 0.899 for the
Rasch testlet model, 0.909 for the Partial Credit model, and 0.936 for the standard
Rasch model. Thus, the standard Rasch model appears to overestimate the test
reliability due to its ignorance of the local item dependence within testlet.
The Spearman-Brown prophecy formula ( )1()1( *
''
*
' xxxxxxxxN ) is used to
compute how much the test length is expected to increase to achieve the standard
Rasch model‟s overestimated test reliability (0.936) for the Rasch testlet model and
Partial Credit model. For the Rasch testlet model, the test length would have to be
increased approximately 63.65% (32 items) to achieve the overestimated test reliability.
For the Partial Credit model, the test length would have to be increased approximately
47.22% (24 items) to achieve the degree of the overestimated test reliability.
NBOME COMLEX-USA exam has been analyzed to detect its local item
dependence by applying the 3Q statistic before (Shen & Yen, 1997). In this study, the
local item dependence detection was conducted by using „NPtest‟ command in „eRm‟
51
package in R instead (Mair &Hatzinger, 2007). The „NPtest‟ is a function to perform
nonparametric Rasch model tests, proposed by Ponocny (2001). The implemented
method we used is the method "T1" to check for local dependence via increased inter-
item correlations. For all item pairs cases are counted with equal responses on both
items. The significant level 0.01 was applied for „NPtest-T1” test in this study. In this 50
item test block, there are 1,225 possible item pairs. Among all 1,225 possible item pairs,
27 item pairs are detected to have significant local item dependences between them.
Results are provided in Table 4-24. For those items within the testlets, the local item
dependences are evident (e.g., item pair-28-30; item pair -28-31; item pair-47-48, etc.).
Fourteen out of total twenty-seven item pairs (51.85%) in which the local item
dependence exists belong to items within testlets.
An overall test statistic ( 11T ) for the local dependence of test is given by using the
nonparametric Rasch model tests (Ponocny, 2001). The global test of the local
dependence for the entire test block is conducted via option “ 11T ”. By summing up the
absolute deviation from the expected value ij to all inter-item correlations ijr in the test,
the one-side p-value of this 50 item block test is 0.371 (significant level 0.05) which
indicates that the global test of local dependence is non-significant and this test (i.e.
NBOME COMLEX-USA level-2 exam block 1) holds the local independence assumption
for the entire test block.
In sum, the partial credit model and the Rasch tesetlet model are the better model
choices to analyze NBOME COMLEX exams. The data from NBOME COMLEX-USA
level-2 exam block 1are better modeled using PCM and Rasch testlet model than the
standard Rasch model. In addition, the test reliability discrepancy between the PCM and
52
the Rasch testlet model to analyze NBOME COMLEX data is within the range of 0.01,
but the test reliability discrepancy between the standard Rasch model and the other two
models to analyze NBOME COMLEX data is approximately over 0.04. This result also
supports that PCM and the Rasch testlet model are the better model choices to analyze
NBOME COMLEX exams.
53
Table 4-1. MLE nonconvergence case and rate per condition-testlet size 3
Testlet Size 3
condition Sample size Testlet No. Local effect Nonconvergence Case Percentage
37 1000 15 0.25 8469 8.47%
38
0.5 9701 9.70%
39
0.75 8010 8.01%
40
1 9955 9.96%
41
10 0.25 1216 1.22%
42
0.5 1007 1.01%
43
0.75 526 0.53%
44
1 680 0.68%
45
5 0.25 45 0.05%
46
0.5 131 0.13%
47
0.75 29 0.03%
48 1 99 0.10%
Table 4-2. MLE nonconvergence case and rate per condition-testlet size 5
Testlet Size 5
condition Sample
size Testlet No. Local effect Nonconvergence
Case Percentage
1 1000 9 0.25 3002 3.00%
2
0.5 3658 3.66%
3
0.75 2406 2.41%
4
1 3016 3.02%
5
6 0.25 289 0.29%
6
0.5 551 0.55%
7
0.75 262 0.26%
8
1 376 0.38%
9
3 0.25 26 0.03%
10
0.5 38 0.04%
11
0.75 40 0.04%
12 1 42 0.04%
54
Table 4-3. Test reliability-testlet size 3 conditions
testlet size 3
Condition Sample
size Testlet
No. Local effect
Testlet model Partial Credit Standard Rasch
1 1000 15 0.25 0.90631 0.90382 0.97843
2
0.5 0.90521 0.90474 0.97785
3
0.75 0.89974 0.90513 0.97649
4
1 0.90663 0.90466 0.97828
5
10 0.25 0.90406 0.90360 0.97821
6
0.5 0.90201 0.90441 0.97772
7
0.75 0.89167 0.90630 0.97463
8
1 0.90059 0.90464 0.97725
9
5 0.25 0.90274 0.90359 0.97809
10
0.5 0.89987 0.90364 0.97740
11
0.75 0.90053 0.90388 0.97756
12
1 0.89820 0.90425 0.97692
13 500 15 0.25 0.89851 0.90440 0.95188
14
0.5 0.91206 0.90303 0.95908
15
0.75 0.89450 0.90663 0.94967
16
1 0.89666 0.90578 0.95095
17
10 0.25 0.90822 0.90235 0.95873
18
0.5 0.90155 0.90453 0.95481
19
0.75 0.89423 0.90583 0.95092
20
1 0.89932 0.90444 0.95408
21
5 0.25 0.90015 0.90488 0.95395
22
0.5 0.88961 0.90540 0.94962
23
0.75 0.90373 0.90395 0.95599
24
1 0.89208 0.90526 0.95064
25 250 15 0.25 0.91113 0.90304 0.91925
26
0.5 0.89811 0.90557 0.91223
27
0.75 0.90328 0.90645 0.90647
28
1 0.89212 0.90576 0.90664
29
10 0.25 0.89754 0.90427 0.90634
30
0.5 0.89718 0.90453 0.90629
31
0.75 0.89223 0.90594 0.90972
32
1 0.90038 0.90551 0.90789
33
5 0.25 0.90464 0.90451 0.91279
34
0.5 0.89704 0.90466 0.90720
35
0.75 0.89793 0.90554 0.90664
36 1 0.90469 0.90347 0.91438
55
Table 4-4. Test reliability-testlet size 5 conditions
testlet size 5
condition Sample
size Testlet
No. Local effect Testlet model Partial Credit Standard Rasch
1 1000 9 0.25 0.90040 0.89809 0.97711
2
0.5 0.90295 0.89787 0.97770
3
0.75 0.90271 0.89796 0.97708
4
1 0.90099 0.89850 0.97669
5
6 0.25 0.89223 0.89873 0.97540
6
0.5 0.90099 0.89812 0.97761
7
0.75 0.88920 0.90030 0.97430
8
1 0.90349 0.89769 0.97830
9
3 0.25 0.90076 0.89734 0.97780
10
0.5 0.89936 0.89840 0.97718
11
0.75 0.89962 0.89817 0.97735
12
1 0.89177 0.89918 0.97513
13 500 9 0.25 0.89885 0.89767 0.95250
14
0.5 0.89788 0.89884 0.95185
15
0.75 0.89535 0.89899 0.95086
16
1 0.90691 0.89752 0.95719
17
6 0.25 0.90195 0.89698 0.95603
18
0.5 0.89905 0.89893 0.95362
19
0.75 0.90251 0.89752 0.95585
20
1 0.90697 0.89763 0.95755
21
3 0.25 0.89304 0.89841 0.95112
22
0.5 0.90059 0.89817 0.95487
23
0.75 0.89500 0.89832 0.95236
24
1 0.90214 0.89692 0.95632
25 250 9 0.25 0.90941 0.89648 0.91701
26
0.5 0.89676 0.89995 0.90275
27
0.75 0.90618 0.89721 0.91367
28
1 0.90682 0.89774 0.91279
29
6 0.25 0.91078 0.89575 0.92096
30
0.5 0.90108 0.89767 0.91065
31
0.75 0.90156 0.89722 0.91182
32
1 0.88919 0.89977 0.90674
33
3 0.25 0.90417 0.89557 0.92314
34
0.5 0.89564 0.89826 0.90485
35
0.75 0.88549 0.90013 0.90393
36 1 0.89959 0.89755 0.90990
56
Table 4-5. Testlet size 3 the results of the Spearman-Brown prophecy
testlet size 3
Condition Sample
size Testlet
No. Local effect
Spearman-Brown (Testlet) Spearman-Brown (Partial Credit)
1 1000 15 0.25 4.689 4.827
2
0.5 4.623 4.648
3
0.75 4.628 4.353
4
1 4.639 4.747
5
10 0.25 4.764 4.789
6
0.5 4.767 4.638
7
0.75 4.667 3.972
8
1 4.742 4.528
9
5 0.25 4.810 4.763
10
0.5 4.812 4.612
11
0.75 4.812 4.633
12
1 4.797 4.482
13 500 15 0.25 2.234 2.091
14
0.5 2.260 2.517
15
0.75 2.225 1.943
16
1 2.234 2.017
17
10 0.25 2.348 2.514
18
0.5 2.307 2.230
19
0.75 2.292 2.014
20
1 2.326 2.195
21
5 0.25 2.298 2.178
22
0.5 2.339 1.969
23
0.75 2.314 2.308
24
1 2.330 2.016
25 250 15 0.25 1.110 1.222
26
0.5 1.179 1.084
27
0.75 1.038 1.000
28
1 1.174 1.010
29
10 0.25 1.105 1.024
30
0.5 1.108 1.021
31
0.75 1.217 1.046
32
1 1.091 1.029
33
5 0.25 1.103 1.105
34
0.5 1.122 1.030
35
0.75 1.104 1.013
36 1 1.125 1.141
57
Table 4-6. Testlet size 5 the results of the Spearman-Brown prophecy
testlet size 5
condition Sample
size Testlet
No. Local effect Spearman-Brown (Testlet) Spearman-Brown (Partial Credit)
1 1000 9 0.25 4.722 4.844
2
0.5 4.712 4.987
3
0.75 4.594 4.844
4
1 4.604 4.733
5
6 0.25 4.789 4.468
6
0.5 4.798 4.953
7
0.75 4.724 4.198
8
1 4.816 5.138
9
3 0.25 4.853 5.039
10
0.5 4.792 4.843
11
0.75 4.815 4.892
12
1 4.759 4.396
13 500 9 0.25 2.257 2.286
14
0.5 2.248 2.225
15
0.75 2.262 2.174
16
1 2.295 2.553
17
6 0.25 2.364 2.497
18
0.5 2.309 2.312
19
0.75 2.339 2.472
20
1 2.314 2.573
21
3 0.25 2.331 2.200
22
0.5 2.336 2.399
23
0.75 2.345 2.263
24
1 2.375 2.516
25 250 9 0.25 1.101 1.276
26
0.5 1.069 1.032
27
0.75 1.096 1.213
28
1 1.075 1.192
29
6 0.25 1.141 1.356
30
0.5 1.119 1.162
31
0.75 1.129 1.185
32
1 1.212 1.083
33
3 0.25 1.273 1.401
34
0.5 1.108 1.077
35
0.75 1.217 1.044
36 1 1.127 1.153
58
Table 4-7. Mean standard error of measurement for each condition (testlet size 3)
condition Sample
size Testlet
No. Local effect Testlet PC Rasch
37 1000 15 0.25 0.300381 0.305033 0.289589
38
0.5 0.296793 0.303545 0.288309
39
0.75 0.295281 0.302921 0.287677
40
1 0.291839 0.303677 0.288298
41
10 0.25 0.301639 0.305362 0.289899
42
0.5 0.299473 0.304075 0.288841
43
0.75 0.298098 0.301055 0.285724
44
1 0.296793 0.303715 0.288325
45
5 0.25 0.303638 0.305379 0.290094
46
0.5 0.303610 0.305306 0.289880
47
0.75 0.302271 0.304918 0.289561
48
1 0.301956 0.304322 0.288957
49 500 15 0.25 0.301526 0.304080 0.288464
50
0.5 0.297063 0.306254 0.290488
51
0.75 0.293822 0.300528 0.285145
52
1 0.293095 0.301892 0.286377
53
10 0.25 0.302328 0.307330 0.291530
54
0.5 0.299370 0.303877 0.288227
55
0.75 0.297841 0.301809 0.286284
56
1 0.297281 0.304030 0.288398
57
5 0.25 0.302365 0.303326 0.287746
58
0.5 0.303011 0.302495 0.286941
59
0.75 0.301661 0.304797 0.289131
60
1 0.301875 0.302717 0.287157
61 250 15 0.25 0.301321 0.306252 0.291301
62
0.5 0.297011 0.302224 0.286611
63
0.75 0.293531 0.300802 0.285815
64
1 0.294556 0.301925 0.286726
65
10 0.25 0.302382 0.304287 0.288809
66
0.5 0.300390 0.303883 0.288478
67
0.75 0.298503 0.301642 0.286330
68
1 0.295590 0.302319 0.286844
69
5 0.25 0.301868 0.303920 0.288364
70
0.5 0.302962 0.303675 0.288801
71
0.75 0.300649 0.302262 0.287081
72 1 0.301612 0.305566 0.290187
59
Table 4-8. Mean standard error of measurement for each condition (testlet size 5)
condition Sample
size Testlet
No. Local effect Testlet PC Rasch
1 1000 9 0.25 0.299785 0.304136 0.275109
2
0.5 0.295323 0.304440 0.275416
3
0.75 0.292322 0.304307 0.275406
4
1 0.289530 0.303489 0.274878
5
6 0.25 0.302390 0.303154 0.274489
6
0.5 0.298983 0.304078 0.275321
7
0.75 0.297526 0.300803 0.272350
8
1 0.296710 0.304706 0.275876
9
3 0.25 0.303190 0.305241 0.276391
10
0.5 0.301505 0.303647 0.274911
11
0.75 0.301083 0.303993 0.275250
12
1 0.300625 0.302478 0.273795
13 500 9 0.25 0.300613 0.304737 0.275479
14
0.5 0.295684 0.303001 0.273933
15
0.75 0.292964 0.302767 0.273778
16
1 0.288456 0.304967 0.275759
17
6 0.25 0.302246 0.305767 0.276519
18
0.5 0.298544 0.302856 0.273895
19
0.75 0.297919 0.304972 0.275769
20
1 0.294441 0.304801 0.275576
21
3 0.25 0.303535 0.303637 0.274653
22
0.5 0.300839 0.303986 0.274908
23
0.75 0.301813 0.303762 0.274750
24
1 0.300681 0.305853 0.276625
25 250 9 0.25 0.299585 0.306513 0.277211
26
0.5 0.294056 0.301339 0.272416
27
0.75 0.292320 0.305417 0.276170
28
1 0.288759 0.304640 0.275400
29
6 0.25 0.302050 0.307584 0.278153
30
0.5 0.299601 0.304732 0.275591
31
0.75 0.297984 0.305409 0.276203
32
1 0.296080 0.301589 0.272642
33
3 0.25 0.302384 0.307852 0.278436
34
0.5 0.301966 0.303866 0.274839
35
0.75 0.301620 0.301048 0.272269
36 1 0.300667 0.304903 0.275727
60
Table 4-9. Testlet size 3 Bias and RMSE of ability estimate recovery (EAP) Testlet size 3
testlet Model Partial Credit Model Standard Rasch Model
condition Sample
size Testlet
No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE
37 1000 15 0.25 0.16783 0.01002 0.09522 0.00897 0.14925 0.01012 38
0.5 0.04346 0.00711 0.03334 0.00690 0.02660 0.00687
39
0.75 0.09725 0.00713 0.05011 0.00631 0.02341 0.00665 40
1 -0.09176 0.00853 -0.17260 0.00970 -0.13382 0.00917
41
10 0.25 0.06458 0.00752 0.15846 0.00704 0.13656 0.00695 42
0.5 0.10475 0.00796 0.10288 0.00792 0.11166 0.00792
43
0.75 -0.14755 0.00626 -0.00742 0.00691 0.08861 0.00867 44
1 -0.00417 0.01003 0.02655 0.01034 -0.01398 0.01000
45
5 0.25 0.11195 0.00713 0.11097 0.00700 0.10368 0.00724 46
0.5 0.11073 0.00726 0.06216 0.00687 0.03382 0.00674
47
0.75 -0.07001 0.00769 -0.10917 0.00841 -0.15884 0.00922 48
1 -0.09311 0.00689 -0.13323 0.00728 -0.18511 0.00840
49 500 15 0.25 0.11591 0.01240 0.07243 0.01268 0.14060 0.01139 50
0.5 0.19790 0.01497 0.06774 0.01321 0.16854 0.01560
51
0.75 -0.07199 0.01209 -0.01459 0.01071 -0.07251 0.01139 52
1 -0.13480 0.01271 -0.07473 0.01080 -0.05624 0.01054
53
10 0.25 0.14430 0.01326 0.10452 0.01271 0.17322 0.01378 54
0.5 0.01525 0.01222 -0.05002 0.01292 0.10895 0.00991
55
0.75 -0.04809 0.00981 -0.01907 0.00944 -0.02956 0.00953 56
1 0.03209 0.01093 -0.03091 0.01121 -0.15904 0.01326
57
5 0.25 -0.16260 0.01096 -0.19737 0.01107 -0.22554 0.01164 58
0.5 0.18784 0.01273 0.22612 0.01345 0.24754 0.01406
59
0.75 0.13105 0.01311 0.09802 0.01204 0.24434 0.01617 60
1 0.11728 0.00977 0.03991 0.00996 0.14934 0.00980
61 250 15 0.25 0.03807 0.01502 -0.04629 0.01401 0.03043 0.01421 62
0.5 0.29854 0.02358 0.24066 0.02088 0.10983 0.01594
63
0.75 -0.07253 0.01541 -0.05782 0.01348 -0.14003 0.01505 64
1 -0.03604 0.01275 -0.05762 0.01021 0.04707 0.01062
65
10 0.25 0.07542 0.01475 0.00293 0.01605 0.00552 0.01547 66
0.5 -0.24154 0.02061 -0.13173 0.01512 -0.20653 0.01843
67
0.75 -0.05263 0.01521 0.02047 0.01530 0.05522 0.01517 68
1 -0.03395 0.01608 -0.18840 0.02246 -0.25207 0.02576
69
5 0.25 0.14591 0.01773 0.15485 0.01743 0.22109 0.01918 70
0.5 0.25183 0.01889 0.20715 0.01614 0.20095 0.01647
71
0.75 0.05471 0.01360 -0.02376 0.01360 0.04982 0.01329
61
Table 4-9. Continued Testlet size 3
testlet Model
Partial Credit Model
Standard Rasch Model 0.02027 0.23503 0.02058
condition Sample
size Testlet
No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE
MAX
0.29854 0.02358 0.24066 0.02246 0.24754 0.02576
Overall Mean
0.02846 0.01181 0.01140 0.01148 0.02249 0.01197
Standard Deviation 0.12285 0.00420 0.11664 0.00415 0.14342 0.00440
62
Table 4-10. Testlet size 5 Bias and RMSE of ability estimate recovery (EAP) Testlet size 5
testlet Model Partial Credit Model Standard Rasch Model
condition Sample
size Testlet
No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE
1 1000 9 0.25 -0.19164 0.00867 -0.19664 0.00843 -0.15635 0.00793 2
0.5 0.15941 0.00980 0.11543 0.00955 0.07367 0.00901
3
0.75 -0.11687 0.01159 -0.03946 0.00972 0.09474 0.00840 4
1 0.18283 0.00951 0.16801 0.00965 0.13127 0.00905
5
6 0.25 0.00801 0.00638 -0.06287 0.00583 0.03093 0.00670 6
0.5 0.50256 0.02672 0.32653 0.01931 0.12354 0.01294
7
0.75 0.10712 0.00810 0.15731 0.00737 0.04488 0.00689 8
1 -0.00526 0.00877 -0.08249 0.00795 -0.12016 0.00827
9
3 0.25 0.06007 0.00712 0.00125 0.00679 0.00938 0.00665 10
0.5 -0.04465 0.00653 -0.05095 0.00658 -0.19904 0.00868
11
0.75 0.26770 0.01191 0.20402 0.01028 0.25305 0.01157 12
1 0.25933 0.00972 0.14790 0.00791 0.15701 0.00782
13 500 9 0.25 -0.13080 0.01029 -0.22012 0.01225 -0.10559 0.00996 14
0.5 -0.08486 0.01197 0.01553 0.01106 -0.17483 0.01453
15
0.75 0.01278 0.01215 -0.06039 0.01137 -0.01830 0.01170 16
1 -0.23682 0.01767 -0.30371 0.02006 -0.31090 0.02059
17
6 0.25 0.05779 0.00887 0.15835 0.01077 0.06598 0.00909 18
0.5 0.03734 0.01203 0.08598 0.01175 -0.00903 0.01127
19
0.75 -0.14255 0.01127 -0.17055 0.01082 -0.24755 0.01176 20
1 -0.23726 0.01445 -0.13707 0.01232 -0.00799 0.01118
21
3 0.25 -0.23119 0.01149 -0.14785 0.00943 -0.16435 0.00959 22
0.5 -0.01187 0.00864 -0.07000 0.00769 -0.15143 0.00828
23
0.75 -0.18909 0.01173 -0.15529 0.01066 -0.19302 0.01106 24
1 -0.17958 0.01117 -0.09252 0.00928 -0.07032 0.00904
25 250 9 0.25 -0.03497 0.01264 0.08511 0.01317 -0.13961 0.01437 26
0.5 0.09268 0.01438 0.04789 0.01380 -0.10150 0.01634
27
0.75 -0.01476 0.01506 -0.03546 0.01554 0.10886 0.01443 28
1 -0.38487 0.02121 -0.28872 0.01855 -0.28307 0.01810
29
6 0.25 0.05265 0.01544 0.13213 0.01716 0.03681 0.01452 30
0.5 0.10775 0.01580 0.09121 0.01380 -0.03764 0.01291
31
0.75 -0.10867 0.01352 -0.19622 0.01602 -0.31813 0.02217 32
1 0.07402 0.01357 0.10687 0.01220 0.03626 0.01079
33
3 0.25 0.28391 0.01869 0.11141 0.01456 0.11499 0.01452 34
0.5 0.24272 0.10725 0.13467 0.11244 -0.02008 0.12209
63
Table 4-10. Continued Testlet size 5
testlet Model Partial Credit Model Standard Rasch Model
condition Sample
size Testlet
No. Local effect mean.Bias mean.RMSE mean.Bias mean.RMSE mean.Bias mean.RMSE
35
0.75 -0.11239 0.01639 -0.04690 0.01527 0.08978 0.01543 36
1 -0.04974 0.01230 0.03783 0.01222 -0.02647 0.01219
MIN
-0.38487 0.00638 -0.30371 0.00583 -0.31813 0.00665 MAX
0.50256 0.10725 0.32653 0.11244 0.25305 0.12209
Overall mean
0.00144 0.01516 -0.00765 0.01455 -0.04165 0.01479 Standard Deviation 0.18094 0.01634 0.14899 0.01718 0.14163 0.01878
64
Table 4-11. Rasch testlet model (Testlet Size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
37 1000 15 0.25 0.473167 0.248143 0.183455 0.149797 0.091889 -0.111914 38
0.5 0.354534 0.114570 0.052191 0.027920 -0.028258 -0.190016
39
0.75 0.370821 0.154818 0.096164 0.085521 0.046400 -0.086360 40
1 0.172154 -0.039595 -0.095668 -0.104551 -0.138436 -0.289969
41
10 0.25 0.275666 0.152291 0.094051 0.046317 -0.037948 -0.209459 42
0.5 0.321742 0.195670 0.127418 0.087044 0.014702 -0.146453
43
0.75 0.005361 -0.072230 -0.123801 -0.164971 -0.230953 -0.376841 44
1 0.175261 0.077036 0.011355 -0.021059 -0.070775 -0.163715
45
5 0.25 0.435356 0.205839 0.137949 0.095536 0.012910 -0.141757 46
0.5 0.395263 0.187794 0.127652 0.095096 0.028981 -0.206536
47
0.75 0.219189 0.015793 -0.053097 -0.089053 -0.145936 -0.299091 48
1 0.242773 -0.016552 -0.076447 -0.111252 -0.172157 -0.361526
49 500 15 0.25 0.384715 0.183385 0.137153 0.108710 0.044345 -0.187429 50
0.5 0.477372 0.267986 0.211695 0.193633 0.131396 -0.150944
51
0.75 0.237362 -0.021731 -0.067369 -0.082894 -0.132175 -0.317696 52
1 0.097377 -0.097434 -0.134401 -0.139968 -0.177940 -0.290971
53
10 0.25 0.356236 0.228351 0.168639 0.126469 0.055644 -0.135595 54
0.5 0.212429 0.102587 0.040898 -0.002519 -0.078594 -0.239629
55
0.75 0.166690 0.043431 -0.030665 -0.070784 -0.135619 -0.289627 56
1 0.211727 0.102454 0.045510 0.011864 -0.035988 -0.183253
57
5 0.25 0.110223 -0.060440 -0.136362 -0.186604 -0.257486 -0.462181 58
0.5 0.462159 0.251401 0.213998 0.171382 0.075050 -0.151453
59
0.75 0.358289 0.211521 0.164480 0.120848 0.032428 -0.219213 60
1 0.444407 0.190858 0.133521 0.100451 0.028398 -0.132833
61 250 15 0.25 0.425444 0.088707 0.038860 0.014501 -0.043279 -0.203645 62
0.5 0.539062 0.356469 0.319489 0.290043 0.250876 0.020486
63
0.75 0.216494 -0.002026 -0.075173 -0.093850 -0.143783 -0.231325 64
1 0.174028 -0.016443 -0.042212 -0.030380 -0.074964 -0.255932
65
10 0.25 0.303218 0.178595 0.097837 0.056333 -0.012798 -0.177952 66
0.5 -0.043971 -0.156407 -0.226762 -0.259869 -0.310529 -0.473968
67
0.75 0.142809 0.024524 -0.040944 -0.074493 -0.129313 -0.256615 68
1 0.141291 0.054279 -0.019997 -0.056531 -0.113450 -0.224637
69
5 0.25 0.368075 0.246382 0.178723 0.124299 0.040473 -0.265577 70
0.5 0.555768 0.318019 0.273727 0.238187 0.132023 0.025779
71
0.75 0.306779 0.136065 0.077814 0.035790 -0.039109 -0.258084
65
Table 4-11. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
72
1 0.352490 0.160919 0.116425 0.086989 0.011101 -0.285605 Overall mean
0.290049 0.111529 0.053503 0.021610 -0.042024 -0.220320
Standard Deviation 0.144891 0.124567 0.127104 0.124312 0.118086 0.107127
66
Table 4-12. Partial credit model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
37 1000 15 0.25 0.336622 0.203823 0.118733 0.070286 0.009691 -0.214690 38
0.5 0.315843 0.153878 0.058782 0.002898 -0.070801 -0.259818
39
0.75 0.324340 0.176789 0.072556 0.012976 -0.055217 -0.232093 40
1 0.079975 -0.054760 -0.153193 -0.210201 -0.273728 -0.474976
41
10 0.25 0.370687 0.259955 0.189054 0.139130 0.056038 -0.192075 42
0.5 0.355115 0.221533 0.130672 0.078644 0.000630 -0.233303
43
0.75 0.214934 0.116662 0.026076 -0.037418 -0.125863 -0.358937 44
1 0.247635 0.149033 0.052803 -0.000008 -0.069264 -0.220998
45
5 0.25 0.381780 0.223662 0.135451 0.088816 0.016893 -0.128480 46
0.5 0.313348 0.168042 0.079470 0.038035 -0.023125 -0.252448
47
0.75 0.155233 0.008629 -0.087356 -0.136890 -0.195870 -0.349476 48
1 0.174679 -0.018619 -0.113778 -0.163059 -0.225781 -0.413138
49 500 15 0.25 0.292574 0.178334 0.107760 0.053391 -0.024299 -0.321069 50
0.5 0.263480 0.169531 0.095938 0.049898 -0.016077 -0.315868
51
0.75 0.306953 0.110805 0.016032 -0.053743 -0.136533 -0.383503 52
1 0.166832 0.041945 -0.043852 -0.108448 -0.185566 -0.335328
53
10 0.25 0.290882 0.192752 0.127494 0.087437 0.025865 -0.211482 54
0.5 0.156712 0.059820 -0.019303 -0.072568 -0.150827 -0.372046
55
0.75 0.229094 0.111338 0.010573 -0.053234 -0.139269 -0.376540 56
1 0.181234 0.075932 -0.007133 -0.062984 -0.129794 -0.355068
57
5 0.25 0.035895 -0.075835 -0.171228 -0.228269 -0.291249 -0.486819 58
0.5 0.478965 0.325959 0.257307 0.199297 0.099948 -0.137633
59
0.75 0.280798 0.197895 0.130120 0.078434 0.003367 -0.224228 60
1 0.324746 0.154661 0.061032 0.009318 -0.064787 -0.239826
61 250 15 0.25 0.282271 0.032899 -0.039813 -0.077479 -0.134475 -0.290433 62
0.5 0.472681 0.359737 0.289173 0.218702 0.150106 -0.141069
63
0.75 0.217801 0.073962 -0.043076 -0.106858 -0.176700 -0.288534 64
1 0.189103 0.057994 -0.030791 -0.088302 -0.178435 -0.440313
65
10 0.25 0.248906 0.128714 0.029408 -0.018945 -0.093579 -0.359625 66
0.5 0.095283 -0.016983 -0.111733 -0.154756 -0.216678 -0.491812
67
0.75 0.280561 0.145899 0.043611 -0.017089 -0.092554 -0.342466 68
1 0.027517 -0.066781 -0.164813 -0.221067 -0.295618 -0.509271
69
5 0.25 0.343087 0.267476 0.181118 0.129846 0.061416 -0.212631 70
0.5 0.474667 0.308877 0.230810 0.184408 0.079458 -0.034207
71
0.75 0.208783 0.092116 0.002041 -0.052994 -0.126155 -0.350564 72
1 0.446835 0.315588 0.241987 0.204057 0.137092 -0.138479
67
Table 4-12. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
Overall mean
0.265718 0.134757 0.047276 -0.006076 -0.079215 -0.296923 Standard Deviation 0.114902 0.112429 0.119329 0.120615 0.117579 0.114145
68
Table 4-13. Standard Rasch model (testlet size 3) Bias of ability ( ) estimate recovery (EAP) with 6 different ability
intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
37 1000 15 0.25 0.377349 0.256007 0.174512 0.123467 0.066063 -0.158090 38
0.5 0.288100 0.142818 0.053952 -0.003419 -0.074942 -0.265710
39
0.75 0.278403 0.145130 0.047520 -0.012507 -0.080783 -0.257701 40
1 0.101816 -0.018816 -0.113572 -0.170834 -0.232747 -0.427665
41
10 0.25 0.335641 0.235327 0.168284 0.116694 0.036510 -0.205762 42
0.5 0.344568 0.225106 0.140802 0.087759 0.012315 -0.217320
43
0.75 0.300918 0.212029 0.123561 0.057606 -0.030350 -0.256467 44
1 0.193358 0.105935 0.014537 -0.041603 -0.108161 -0.261854
45
5 0.25 0.358078 0.212635 0.130214 0.080768 0.012205 -0.136275 46
0.5 0.270038 0.136912 0.052790 0.009639 -0.051126 -0.274681
47
0.75 0.088620 -0.045237 -0.134873 -0.186832 -0.242925 -0.400196 48
1 0.106223 -0.073708 -0.163902 -0.214979 -0.276585 -0.462143
49 500 15 0.25 0.367581 0.248949 0.175592 0.120463 0.044052 -0.251761 50
0.5 0.374032 0.274207 0.197322 0.149253 0.082147 -0.219028
51
0.75 0.244773 0.050650 -0.042050 -0.110614 -0.193649 -0.444443 52
1 0.188429 0.061434 -0.025102 -0.090149 -0.168285 -0.320278
53
10 0.25 0.364199 0.263265 0.196480 0.155625 0.092914 -0.146044 54
0.5 0.322251 0.221667 0.140289 0.085397 0.005789 -0.217023
55
0.75 0.226423 0.103118 0.000191 -0.064436 -0.151322 -0.390154 56
1 0.056398 -0.050612 -0.134877 -0.191533 -0.259697 -0.487368
57
5 0.25 0.010390 -0.103156 -0.199253 -0.256550 -0.320193 -0.517764 58
0.5 0.503030 0.348689 0.279222 0.220309 0.119510 -0.119608
59
0.75 0.434021 0.346219 0.276517 0.224106 0.148551 -0.080009 60
1 0.439148 0.265474 0.170702 0.118328 0.043324 -0.133062
61 250 15 0.25 0.333248 0.103105 0.038560 0.001722 -0.053561 -0.214062 62
0.5 0.334339 0.229383 0.161915 0.085442 0.019557 -0.273484
63
0.75 0.115012 -0.012460 -0.124758 -0.185956 -0.258118 -0.375915 64
1 0.276731 0.158015 0.074835 0.017737 -0.073405 -0.329492
65
10 0.25 0.237986 0.128453 0.034612 -0.017241 -0.092265 -0.345118 66
0.5 0.012850 -0.090193 -0.183208 -0.231655 -0.295516 -0.564457
67
0.75 0.303445 0.177303 0.079462 0.018418 -0.060361 -0.294325 68
1 -0.045749 -0.130230 -0.226555 -0.285692 -0.361091 -0.571902
69
5 0.25 0.401088 0.333323 0.249247 0.194643 0.126869 -0.137617 70
0.5 0.446812 0.297214 0.225627 0.179150 0.076715 -0.035870
71
0.75 0.266258 0.162389 0.077149 0.020855 -0.051654 -0.275809
69
Table 4-13. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
72
1 0.447285 0.327455 0.257638 0.216147 0.151092 -0.120725 Overall mean
0.269530 0.145772 0.060927 0.006098 -0.066642 -0.283033
Standard Deviation 0.138194 0.138171 0.145588 0.147425 0.144426 0.137488
70
Table 4-14. Rasch testlet model (Testlet Size 5) Bias of ability ( ) estimate recovery (EAP) with 6 different ability Intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
1 1000 9 0.25 0.094350 -0.095255 -0.174461 -0.213346 -0.281807 -0.428692 2
0.5 0.436489 0.237122 0.179649 0.147421 0.082498 -0.098116
3
0.75 0.122463 -0.050085 -0.104181 -0.127667 -0.196337 -0.321018 4
1 0.442925 0.250876 0.195557 0.171773 0.105662 -0.004987
5
6 0.25 0.379519 0.091610 0.030039 -0.010922 -0.073238 -0.322554 6
0.5 0.806265 0.587031 0.530736 0.488181 0.419653 0.115971
7
0.75 0.421201 0.186316 0.125033 0.086932 0.021690 -0.247222 8
1 0.357342 0.082943 0.013761 -0.023014 -0.081773 -0.347509
9
3 0.25 0.362618 0.155299 0.084661 0.030937 -0.039237 -0.213096 10
0.5 0.312980 0.065283 -0.015409 -0.071428 -0.131333 -0.272589
11
0.75 0.560732 0.350580 0.290498 0.234278 0.171695 0.001140 12
1 0.551760 0.339561 0.285196 0.233684 0.176530 -0.038358
13 500 9 0.25 0.145720 -0.033425 -0.114898 -0.143125 -0.207488 -0.400981 14
0.5 0.195187 -0.009050 -0.066950 -0.102276 -0.156287 -0.310182
15
0.75 0.315093 0.077165 0.023492 0.001618 -0.062240 -0.183802 16
1 0.084424 -0.159926 -0.233283 -0.259885 -0.321462 -0.424321
17
6 0.25 0.456119 0.138420 0.074700 0.034685 -0.039098 -0.245486 18
0.5 0.290231 0.128894 0.062876 0.020589 -0.044900 -0.247926
19
0.75 0.171234 -0.057490 -0.132013 -0.164068 -0.217203 -0.436617 20
1 0.150614 -0.149303 -0.226614 -0.265776 -0.326516 -0.478286
21
3 0.25 0.174309 -0.132377 -0.207516 -0.258158 -0.313038 -0.455341 22
0.5 0.255474 0.088922 0.017856 -0.042086 -0.097710 -0.226965
23
0.75 0.094036 -0.091653 -0.173193 -0.223266 -0.274488 -0.389311 24
1 0.148404 -0.087851 -0.155808 -0.202989 -0.257186 -0.429154
25 250 9 0.25 0.223756 0.047142 -0.020527 -0.060633 -0.128812 -0.269732 26
0.5 0.428441 0.184559 0.111482 0.065017 -0.005950 -0.094928
27
0.75 0.237617 0.041268 0.002304 -0.029471 -0.083137 -0.206376 28
1 -0.121960 -0.317932 -0.390940 -0.413389 -0.458683 -0.459374
29
6 0.25 0.309336 0.135764 0.074997 0.035037 -0.040391 -0.297998 30
0.5 0.380690 0.168035 0.126960 0.096969 0.033264 -0.191007
31
0.75 0.253362 -0.020784 -0.083982 -0.115322 -0.160088 -0.540590 32
1 0.314140 0.133842 0.093918 0.053564 0.002582 -0.247434
33
3 0.25 0.576460 0.384467 0.313977 0.263214 0.192252 0.044460 34
0.5 3.193850 1.675706 0.751623 -0.237276 -1.192136 -1.719127
35
0.75 0.117816 -0.015918 -0.074050 -0.135772 -0.192788 -0.400226
71
Table 4-14. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
36
1 0.198452 0.029064 -0.021486 -0.078501 -0.140366 -0.262858 Overall mean
0.373374 0.121078 0.033167 -0.033735 -0.119941 -0.306961
Standard Deviation 0.514662 0.318963 0.218118 0.181103 0.253235 0.289446
72
Table 4-15. Partial credit model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
1 1000 9 0.25 0.057694 -0.074844 -0.173165 -0.226777 -0.293046 -0.493063 2
0.5 0.357073 0.227225 0.146618 0.092851 0.026046 -0.211425
3
0.75 0.183253 0.067466 -0.015145 -0.064679 -0.142554 -0.330352 4
1 0.414649 0.288070 0.197109 0.140516 0.057407 -0.106933
5
6 0.25 0.254284 0.056994 -0.037817 -0.089405 -0.166245 -0.386558 6
0.5 0.566313 0.442467 0.356654 0.305465 0.225716 -0.009284
7
0.75 0.445097 0.286063 0.183949 0.122392 0.034249 -0.192302 8
1 0.179819 0.039463 -0.057599 -0.109310 -0.178169 -0.357742
9
3 0.25 0.255866 0.110652 0.021129 -0.021659 -0.100628 -0.311067 10
0.5 0.255503 0.078655 -0.022398 -0.072849 -0.147754 -0.331184
11
0.75 0.465425 0.307731 0.224380 0.172734 0.095564 -0.136830 12
1 0.424246 0.261036 0.176704 0.121079 0.041691 -0.259527
13 500 9 0.25 0.005684 -0.103587 -0.196281 -0.239268 -0.298955 -0.544116 14
0.5 0.257783 0.125218 0.046717 -0.013190 -0.070077 -0.279283
15
0.75 0.214159 0.048767 -0.033883 -0.088606 -0.162287 -0.352665 16
1 -0.052872 -0.202811 -0.289531 -0.336450 -0.396747 -0.559555
17
6 0.25 0.435278 0.257183 0.176986 0.133877 0.062489 -0.090830 18
0.5 0.290807 0.206287 0.116569 0.060745 -0.011217 -0.173629
19
0.75 0.059018 -0.062851 -0.155885 -0.200646 -0.256321 -0.387814 20
1 0.118874 -0.030462 -0.119601 -0.169003 -0.232470 -0.328117
21
3 0.25 0.198857 -0.028118 -0.121680 -0.173547 -0.245166 -0.437178 22
0.5 0.133467 0.044319 -0.042115 -0.095718 -0.158739 -0.316826
23
0.75 0.089688 -0.035384 -0.139598 -0.189674 -0.257481 -0.409490 24
1 0.162975 0.012675 -0.068295 -0.111421 -0.172786 -0.376298
25 250 9 0.25 0.283943 0.171075 0.102312 0.057526 0.003210 -0.172672 26
0.5 0.322473 0.171561 0.077414 0.008952 -0.061644 -0.198336
27
0.75 0.165109 0.050177 -0.006297 -0.061494 -0.109914 -0.271395 28
1 -0.058034 -0.199634 -0.286395 -0.329072 -0.374657 -0.446141
29
6 0.25 0.291949 0.217848 0.150383 0.116097 0.052439 -0.114647 30
0.5 0.293698 0.177147 0.115804 0.074011 0.006296 -0.170862
31
0.75 0.045159 -0.091565 -0.168066 -0.213752 -0.266624 -0.512562 32
1 0.321524 0.210881 0.136044 0.071290 0.000412 -0.199886
33
3 0.25 0.303234 0.202762 0.132833 0.100714 0.039509 -0.107170 34
0.5 3.077542 1.569886 0.643957 -0.344942 -1.302150 -1.829526
35
0.75 0.149291 0.083300 0.000826 -0.071642 -0.152472 -0.447450 36
1 0.244549 0.134084 0.064922 0.012723 -0.058589 -0.223248
73
Table 4-15. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
Overall mean
0.311483 0.139437 0.031599 -0.045337 -0.138102 -0.335443 Standard Deviation 0.496357 0.285833 0.182890 0.157178 0.245830 0.291249
74
Table 4-16. Standard Rasch model (testlet size 5) bias of ability ( ) estimate recovery (EAP) with 6 different ability
intervals
condition Sample
size Testlet No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
1 1000 9 0.25 0.099971 -0.034221 -0.133403 -0.186551 -0.252010 -0.452877 2
0.5 0.318107 0.186509 0.104965 0.050271 -0.015544 -0.253090
3
0.75 0.310948 0.202532 0.120125 0.068132 -0.008049 -0.195075 4
1 0.360180 0.246459 0.160534 0.104280 0.025703 -0.138183
5
6 0.25 0.333569 0.147097 0.057703 0.004335 -0.071557 -0.289829 6
0.5 0.348352 0.236751 0.154771 0.100999 0.026762 -0.203307
7
0.75 0.319864 0.171062 0.072551 0.009943 -0.077491 -0.297234 8
1 0.126029 -0.002200 -0.093594 -0.147499 -0.213030 -0.391729
9
3 0.25 0.248496 0.116344 0.031513 -0.013947 -0.091865 -0.298587 10
0.5 0.091450 -0.073315 -0.168065 -0.222511 -0.293491 -0.471152
11
0.75 0.502321 0.354719 0.274381 0.222110 0.145063 -0.079361 12
1 0.418671 0.266539 0.186816 0.130330 0.051680 -0.236261
13 500 9 0.25 0.135963 0.020694 -0.078640 -0.127076 -0.193024 -0.445134 14
0.5 0.080448 -0.057799 -0.140910 -0.206017 -0.267658 -0.482120
15
0.75 0.266265 0.095822 0.009320 -0.048039 -0.124395 -0.318598 16
1 -0.049814 -0.205177 -0.294881 -0.344937 -0.410871 -0.582785
17
6 0.25 0.344499 0.166983 0.086702 0.040895 -0.034596 -0.191486 18
0.5 0.200928 0.113445 0.022554 -0.034379 -0.109346 -0.276489
19
0.75 -0.012288 -0.136464 -0.230731 -0.278090 -0.339425 -0.480841 20
1 0.260057 0.105714 0.011334 -0.042154 -0.110330 -0.210707
21
3 0.25 0.184735 -0.044013 -0.138392 -0.190226 -0.261472 -0.453125 22
0.5 0.056858 -0.034257 -0.122437 -0.177865 -0.243289 -0.404036
23
0.75 0.054648 -0.072215 -0.177272 -0.227534 -0.296165 -0.450183 24
1 0.186734 0.036123 -0.045174 -0.089354 -0.152590 -0.358227
25 250 9 0.25 0.074001 -0.047596 -0.123366 -0.169803 -0.222756 -0.401669 26
0.5 0.190891 0.030451 -0.069547 -0.142738 -0.219874 -0.367389
27
0.75 0.317890 0.197999 0.141188 0.083001 0.026529 -0.145503 28
1 -0.038382 -0.186940 -0.277641 -0.325517 -0.383315 -0.475526
29
6 0.25 0.206191 0.127465 0.055987 0.019072 -0.046796 -0.218904 30
0.5 0.173974 0.052087 -0.012475 -0.056456 -0.124616 -0.303180
31
0.75 -0.064167 -0.206921 -0.289422 -0.337861 -0.390709 -0.638718 32
1 0.265819 0.147722 0.067491 -0.002090 -0.077987 -0.285482
33
3 0.25 0.313663 0.208113 0.136700 0.103557 0.041626 -0.105400 34
0.5 2.922879 1.415147 0.489106 -0.499631 -1.456934 -1.983828
35
0.75 0.294057 0.222541 0.138180 0.064553 -0.017974 -0.315192
75
Table 4-16. Continued
condition Sample
size Testlet No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
36
1 0.185198 0.072606 0.002331 -0.052120 -0.127573 -0.298190 Overall mean
0.278583 0.106661 -0.001992 -0.081137 -0.175483 -0.374983
Standard Deviation 0.473497 0.264457 0.167108 0.160390 0.260040 0.305778
76
Table 4-17. Rasch testlet Model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability
intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
1000 15 0.25 0.078337 0.026778 0.013562 0.013996 0.022898 0.048603 38
0.5 0.058554 0.023512 0.013291 0.011807 0.020247 0.043903
39
0.75 0.109081 0.025007 0.015425 0.011851 0.024913 0.015921 40
1 0.042240 0.021694 0.016620 0.015158 0.024771 0.045446
41
10 0.25 0.078190 0.019787 0.011747 0.011594 0.021015 0.003677 42
0.5 0.093387 0.022832 0.013078 0.013353 0.021377 0.005111
43
0.75 0.073321 0.016861 0.013579 0.011982 0.026210 0.057016 44
1 0.070156 0.027408 0.014366 0.014792 0.020166 0.083354
45
5 0.25 0.084415 0.019277 0.012643 0.012258 0.018854 0.027661 46
0.5 0.084047 0.023888 0.012558 0.012447 0.020792 0.084284
47
0.75 0.059131 0.019724 0.011052 0.012973 0.024332 0.122048 48
1 0.066213 0.018199 0.012070 0.011800 0.021689 0.081700
49 500 15 0.25 0.192334 0.033274 0.019610 0.017669 0.023263 0.023885 50
0.5 0.147215 0.039461 0.020473 0.022246 0.027377 0.028104
51
0.75 0.088903 0.030005 0.018717 0.020319 0.038214 0.321320 52
1 0.086655 0.032271 0.022941 0.021482 0.034426 0.075319
53
10 0.25 0.119339 0.029762 0.020720 0.024285 0.024943 0.150809 54
0.5 0.080430 0.033565 0.018110 0.014816 0.041293 0.006453
55
0.75 0.082724 0.025981 0.017898 0.015908 0.038934 0.015708 56
1 0.098487 0.033379 0.020264 0.018472 0.030859 0.023337
57
5 0.25 0.086273 0.028639 0.018110 0.020713 0.035449 0.026388 58
0.5 0.143987 0.037974 0.019936 0.021455 0.032242 0.071056
59
0.75 0.156287 0.036786 0.022283 0.019618 0.028308 0.131062 60
1 0.184270 0.038350 0.019849 0.016660 0.029845 0.013516
61 250 15 0.25 0.292779 0.039214 0.027430 0.023809 0.034594 0.087950 62
0.5 0.218673 0.084971 0.042775 0.036884 0.055996 0.086146
63
0.75 0.144418 0.049953 0.025300 0.027950 0.047892 0.264669 64
1 0.140580 0.042917 0.021739 0.025272 0.049209 0.330666
65
10 0.25 0.178133 0.057100 0.027217 0.021679 0.040607 0.114075 66
0.5 0.120084 0.049634 0.029591 0.037874 0.055365 0.092475
67
0.75 0.113360 0.051918 0.024542 0.025663 0.053502 0.196613 68
1 0.155949 0.036788 0.026793 0.027741 0.042362 0.004325
69
5 0.25 0.134323 0.057182 0.030049 0.026274 0.042960 0.265263 70
0.5 0.182210 0.067420 0.030899 0.032589 0.045741 0.266494
71
0.75 0.119217 0.047717 0.022420 0.024159 0.041848 0.088415
77
Table 4-17. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
72
1 0.108693 0.045403 0.029070 0.027282 0.041949 0.020853 Overall mean
0.118678 0.035962 0.020465 0.020134 0.033457 0.092323
Standard Deviation 0.052386 0.015056 0.006978 0.007164 0.011086 0.092709
78
Table 4-18. Partial credit model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
37 1000 15 0.25 0.065677 0.023965 0.011388 0.011902 0.023170 0.067682 38
0.5 0.060728 0.023976 0.013179 0.011456 0.020410 0.043749
39
0.75 0.099028 0.024728 0.013634 0.010496 0.023721 0.010887 40
1 0.048958 0.019700 0.017045 0.017225 0.031831 0.086938
41
10 0.25 0.094635 0.027001 0.014954 0.013781 0.019667 0.012397 42
0.5 0.090130 0.024572 0.013120 0.013289 0.021730 0.009448
43
0.75 0.083380 0.018589 0.011610 0.010228 0.022013 0.024883 44
1 0.078642 0.028244 0.014527 0.013482 0.020061 0.096752
45
5 0.25 0.082899 0.019888 0.012796 0.012045 0.019864 0.030363 46
0.5 0.080121 0.023337 0.011771 0.011788 0.020182 0.079555
47
0.75 0.064722 0.019505 0.011721 0.014189 0.025794 0.124530 48
1 0.067890 0.017737 0.012473 0.012460 0.024051 0.097091
49 500 15 0.25 0.154701 0.035439 0.017829 0.016028 0.022127 0.020737 50
0.5 0.105886 0.034800 0.017357 0.017181 0.026016 0.025455
51
0.75 0.097656 0.027177 0.016228 0.018008 0.034151 0.270534 52
1 0.086244 0.028811 0.018696 0.018263 0.034193 0.068944
53
10 0.25 0.111764 0.029271 0.018249 0.023287 0.025120 0.139872 54
0.5 0.078844 0.030477 0.018031 0.014985 0.043723 0.016498
55
0.75 0.090798 0.028041 0.017230 0.015321 0.039667 0.031949 56
1 0.098187 0.032181 0.018453 0.018955 0.032947 0.000127
57
5 0.25 0.091645 0.029466 0.018295 0.022806 0.038532 0.037234 58
0.5 0.147778 0.045405 0.021106 0.022665 0.033154 0.080828
59
0.75 0.137374 0.036145 0.020466 0.019162 0.028722 0.122420 60
1 0.154455 0.036054 0.016647 0.016980 0.029782 0.033431
61 250 15 0.25 0.228860 0.042316 0.025572 0.026715 0.039057 0.039881 62
0.5 0.190427 0.086004 0.037864 0.029954 0.044274 0.067882
63
0.75 0.132743 0.045510 0.021947 0.024455 0.049834 0.315181 64
1 0.139310 0.033277 0.017402 0.021319 0.052945 0.171415
65
10 0.25 0.169489 0.053985 0.023519 0.021089 0.044170 0.138256 66
0.5 0.121261 0.048928 0.024269 0.027797 0.046095 0.047430
67
0.75 0.120247 0.054696 0.024694 0.022111 0.050509 0.222708 68
1 0.136154 0.038438 0.026148 0.033798 0.059179 0.047885
69
5 0.25 0.138147 0.061613 0.029549 0.026912 0.045897 0.267080 70
0.5 0.175470 0.065379 0.026202 0.027847 0.042611 0.178891
71
0.75 0.110521 0.045314 0.022411 0.023566 0.046565 0.070210 72
1 0.147110 0.059226 0.037350 0.033784 0.045651 0.060984
79
Table 4-18. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
Overall mean
0.113386 0.036089 0.019270 0.019315 0.034095 0.087781
Standard Deviation 0.040422 0.015425 0.006583 0.006572 0.011399 0.079865
80
Table 4-19. Standard Rasch model (testlet size 3) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability
intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
37 1000 15 0.25 0.071970 0.027048 0.012636 0.012719 0.023061 0.060829 38
0.5 0.060926 0.023909 0.012678 0.011401 0.020533 0.045204
39
0.75 0.090384 0.023542 0.013484 0.011052 0.022812 0.019934 40
1 0.050341 0.019446 0.016044 0.016289 0.029408 0.083912
41
10 0.25 0.086566 0.025718 0.014281 0.012950 0.019438 0.008700 42
0.5 0.086802 0.025173 0.012795 0.013299 0.021531 0.011279
43
0.75 0.094025 0.023338 0.011782 0.010458 0.020305 0.004227 44
1 0.073421 0.027330 0.014314 0.014277 0.020211 0.100202
45
5 0.25 0.080224 0.019463 0.012707 0.012450 0.019174 0.027102 46
0.5 0.075284 0.022845 0.011400 0.011556 0.020183 0.071554
47
0.75 0.062711 0.020036 0.012618 0.015555 0.026669 0.131456 48
1 0.063622 0.018998 0.013537 0.014393 0.026115 0.105073
49 500 15 0.25 0.174828 0.040648 0.019626 0.017030 0.022111 0.000227 50
0.5 0.127802 0.041100 0.018820 0.018756 0.026461 0.005880
51
0.75 0.093012 0.025979 0.016592 0.019151 0.037344 0.303639 52
1 0.089175 0.028986 0.018560 0.017811 0.033151 0.060292
53
10 0.25 0.123712 0.033044 0.020933 0.025251 0.024102 0.154511 54
0.5 0.098728 0.040028 0.018633 0.015801 0.039165 0.021575
55
0.75 0.089436 0.027861 0.017481 0.015466 0.040531 0.034278 56
1 0.088494 0.027750 0.020219 0.022417 0.041304 0.037713
57
5 0.25 0.089638 0.029695 0.019246 0.024006 0.040744 0.044998 58
0.5 0.153260 0.047723 0.022317 0.023691 0.034507 0.088560
59
0.75 0.180469 0.048970 0.027495 0.022214 0.031881 0.162449 60
1 0.186296 0.042333 0.019634 0.016705 0.029095 0.007079
61 250 15 0.25 0.244394 0.045048 0.025937 0.023603 0.034617 0.039916 62
0.5 0.147001 0.065525 0.028917 0.025894 0.037003 0.022912
63
0.75 0.113811 0.045457 0.024075 0.027296 0.053748 0.376276 64
1 0.157073 0.034502 0.018100 0.019511 0.046642 0.240287
65
10 0.25 0.167921 0.055283 0.023304 0.021270 0.042570 0.130323 66
0.5 0.123050 0.048287 0.027207 0.033872 0.052401 0.093415
67
0.75 0.125161 0.056347 0.024480 0.022407 0.048606 0.187861 68
1 0.134140 0.041803 0.028784 0.038777 0.067885 0.084995
69
5 0.25 0.154649 0.069085 0.032520 0.030000 0.047157 0.292877 70
0.5 0.168907 0.064514 0.025804 0.028420 0.042436 0.180015
71
0.75 0.121488 0.053159 0.021906 0.023698 0.041709 0.099905
81
Table 4-19. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.bias1 mean.bias2 mean.bias3 mean.bias4 mean.bias5 mean.bias6
72
1 0.149801 0.060850 0.037591 0.034294 0.045378 0.056767 Overall mean
0.116626 0.037523 0.019902 0.020104 0.034166 0.094340
Standard Deviation 0.044101 0.014734 0.006456 0.007159 0.011939 0.092177
82
Table 4-20. Rasch testlet model (testlet size 5) RMSE of ability ( ) estimate recovery (eap) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
1 1000 9 0.25 0.048535 0.020547 0.013542 0.014905 0.025987 0.132773 2
0.5 0.104665 0.027705 0.016009 0.016919 0.019188 0.098599
3
0.75 0.048658 0.022476 0.013048 0.016071 0.030985 0.114531 4
1 0.081613 0.028248 0.015910 0.014095 0.021880 0.069132
5
6 0.25 0.070898 0.022451 0.010947 0.011000 0.024396 0.062862 6
0.5 0.172489 0.050252 0.031796 0.026443 0.036537 0.004791
7
0.75 0.093437 0.024567 0.012562 0.013693 0.022836 0.080497 8
1 0.119667 0.021661 0.015078 0.010693 0.021798 0.100457
9
3 0.25 0.087055 0.020796 0.012494 0.011860 0.019987 0.034308 10
0.5 0.088673 0.022802 0.011603 0.010394 0.022048 0.026025
11
0.75 0.114683 0.029492 0.020703 0.015341 0.022141 0.175543 12
1 0.119646 0.030957 0.016737 0.016704 0.028018 0.003056
13 500 9 0.25 0.073423 0.028199 0.017699 0.019940 0.030755 0.112060 14
0.5 0.104189 0.034109 0.018287 0.020583 0.034834 0.062598
15
0.75 0.110667 0.032705 0.017843 0.016505 0.026520 0.141207 16
1 0.072086 0.029727 0.024652 0.026943 0.048265 0.173372
17
6 0.25 0.093659 0.037446 0.020469 0.014655 0.026255 0.077015 18
0.5 0.088630 0.032820 0.018278 0.020875 0.030910 0.080459
19
0.75 0.098950 0.028540 0.021342 0.022284 0.033999 0.126608 20
1 0.071756 0.040521 0.024634 0.027753 0.049917 0.147327
21
3 0.25 0.146400 0.027316 0.019713 0.022029 0.054734 0.101774 22
0.5 0.094459 0.034377 0.014855 0.018803 0.030531 0.068490
23
0.75 0.083904 0.028614 0.019628 0.022143 0.038706 0.030053 24
1 0.072267 0.031387 0.019253 0.018611 0.040400 0.066370
25 250 9 0.25 0.104921 0.036662 0.020843 0.030739 0.053081 0.013633 26
0.5 0.137340 0.059429 0.025746 0.024807 0.045477 0.180335
27
0.75 0.136443 0.040754 0.027155 0.025382 0.051492 0.075550 28
1 0.107111 0.047915 0.050709 0.045143 0.084078 0.391390
29
6 0.25 0.144626 0.045561 0.025586 0.027354 0.038179 0.026828 30
0.5 0.138057 0.055004 0.026784 0.026009 0.041708 0.362921
31
0.75 0.199151 0.049567 0.024088 0.024056 0.045059 0.011575 32
1 0.127239 0.046378 0.022620 0.025210 0.046614 0.234785
33
3 0.25 0.183628 0.067105 0.045098 0.031144 0.051958 0.060797 34
0.5 1.788592 0.244154 0.092046 0.186136 0.065807 0.192168
35
0.75 0.156430 0.046850 0.026329 0.027011 0.046847 0.061260 36
1 0.105701 0.042695 0.021221 0.021594 0.048607 0.138145
83
Table 4-20. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
Overall mean
0.155268 0.041383 0.023203 0.025662 0.037793 0.106647 Standard Deviation 0.282199 0.036636 0.014413 0.028410 0.014431 0.087627
84
Table 4-21. Partial credit model (testlet size 5) RMSE of ability ( ) estimate recovery (EAP) with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
1 1000 9 0.25 0.055078 0.019749 0.012548 0.014505 0.026118 0.128978 2
0.5 0.093575 0.027011 0.014290 0.015383 0.020417 0.085914
3
0.75 0.062688 0.019097 0.011141 0.013724 0.025965 0.079033 4
1 0.090919 0.029425 0.016153 0.012023 0.019412 0.075747
5
6 0.25 0.071424 0.022914 0.010005 0.011628 0.028386 0.072207 6
0.5 0.124677 0.038863 0.023997 0.018060 0.025961 0.021936
7
0.75 0.105384 0.028632 0.013467 0.012453 0.024295 0.116020 8
1 0.069542 0.019988 0.013672 0.012453 0.024686 0.120254
9
3 0.25 0.071793 0.019053 0.011908 0.011416 0.021645 0.052494 10
0.5 0.080230 0.022653 0.011691 0.010857 0.022459 0.044464
11
0.75 0.100348 0.027188 0.017871 0.013633 0.020760 0.158864 12
1 0.100129 0.025238 0.013624 0.013346 0.022882 0.052848
13 500 9 0.25 0.087658 0.030786 0.021063 0.021593 0.037533 0.131595 14
0.5 0.120225 0.035837 0.017506 0.018423 0.032209 0.072691
15
0.75 0.108335 0.030604 0.014933 0.016627 0.026840 0.091080 16
1 0.075110 0.030694 0.026972 0.032415 0.054812 0.181353
17
6 0.25 0.119522 0.049433 0.021642 0.017807 0.026338 0.109247 18
0.5 0.104570 0.037916 0.018646 0.020389 0.030663 0.073772
19
0.75 0.094708 0.027392 0.021657 0.023170 0.036143 0.138768 20
1 0.091668 0.037460 0.020997 0.021922 0.042133 0.108095
21
3 0.25 0.153340 0.032996 0.016169 0.017654 0.047098 0.080737 22
0.5 0.094481 0.032526 0.013223 0.019982 0.033307 0.096095
23
0.75 0.096467 0.026004 0.019063 0.019189 0.036804 0.054503 24
1 0.087235 0.031068 0.017224 0.015474 0.034156 0.001936
25 250 9 0.25 0.133246 0.043864 0.021702 0.027811 0.039083 0.049585 26
0.5 0.145823 0.058810 0.024323 0.023814 0.043880 0.212430
27
0.75 0.117474 0.038922 0.023324 0.026192 0.049893 0.067516 28
1 0.117828 0.041894 0.037692 0.036645 0.068336 0.253170
29
6 0.25 0.111735 0.051914 0.028435 0.029285 0.036509 0.048822 30
0.5 0.135587 0.055161 0.023390 0.024883 0.040852 0.313178
31
0.75 0.256954 0.051364 0.026250 0.028499 0.056166 0.027842 32
1 0.153590 0.046054 0.020337 0.025486 0.048274 0.178885
33
3 0.25 0.117107 0.047913 0.032525 0.024273 0.046733 0.025307 34
0.5 1.727757 0.225916 0.080002 0.195150 0.050771 0.121486
35
0.75 0.170151 0.046210 0.023578 0.025170 0.042063 0.056915 36
1 0.127922 0.043790 0.021082 0.021712 0.044274 0.156180
85
Table 4-21. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
Overall mean
0.154841 0.040398 0.021169 0.024807 0.035774 0.101665
Standard Deviation
0.272132 0.033626 0.011842 0.029903 0.011965 0.066253
86
Table 4-22. Standard Rasch model (testlet size 5) RMSE of ability ( ) estimate recovery with 6 different ability intervals
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
1 1000 9 0.25 0.054505 0.019157 0.011612 0.013639 0.024171 0.123000 2
0.5 0.086199 0.025481 0.013019 0.015002 0.020782 0.077019
3
0.75 0.079972 0.019311 0.012218 0.013808 0.022459 0.055386 4
1 0.082276 0.027552 0.015155 0.011565 0.018216 0.072618
5
6 0.25 0.080914 0.025310 0.011499 0.010704 0.024551 0.058512 6
0.5 0.083499 0.026217 0.015746 0.011670 0.020758 0.055891
7
0.75 0.086500 0.023057 0.011474 0.011640 0.026323 0.098057 8
1 0.062397 0.020569 0.014227 0.013574 0.026134 0.123626
9
3 0.25 0.070911 0.019228 0.011657 0.011668 0.021136 0.050452 10
0.5 0.064980 0.022725 0.015423 0.013908 0.028543 0.072700
11
0.75 0.107659 0.030363 0.020102 0.015079 0.020511 0.172489 12
1 0.098714 0.025765 0.013472 0.013395 0.022484 0.045555
13 500 9 0.25 0.086103 0.027197 0.017137 0.019369 0.030920 0.104854 14
0.5 0.092530 0.034024 0.017266 0.022220 0.042295 0.115485
15
0.75 0.114248 0.031499 0.015170 0.015623 0.025589 0.100740 16
1 0.074132 0.030469 0.027548 0.032850 0.056242 0.195890
17
6 0.25 0.102211 0.040934 0.020133 0.015024 0.026044 0.075459 18
0.5 0.092445 0.032911 0.017773 0.019562 0.030586 0.092825
19
0.75 0.090080 0.029781 0.025252 0.026233 0.043281 0.153303 20
1 0.101604 0.039027 0.019060 0.018473 0.035121 0.066013
21
3 0.25 0.150032 0.032273 0.016446 0.018240 0.048618 0.085607 22
0.5 0.088842 0.031174 0.014248 0.023129 0.038932 0.118433
23
0.75 0.096743 0.026986 0.020221 0.020743 0.040151 0.041003 24
1 0.088962 0.030766 0.016913 0.015060 0.033199 0.003464
25 250 9 0.25 0.108200 0.037172 0.023692 0.035562 0.060802 0.055020 26
0.5 0.118448 0.050578 0.023559 0.028191 0.052439 0.293215
27
0.75 0.152811 0.044221 0.025725 0.024325 0.041710 0.035497 28
1 0.112522 0.040882 0.037117 0.035945 0.068377 0.269680
29
6 0.25 0.097628 0.045396 0.024069 0.027508 0.038001 0.009918 30
0.5 0.115448 0.046363 0.021887 0.025772 0.045353 0.245964
31
0.75 0.261414 0.057456 0.033796 0.039445 0.070626 0.037632 32
1 0.131436 0.040350 0.017983 0.023282 0.051087 0.221120
33
3 0.25 0.118245 0.047826 0.032667 0.024198 0.046625 0.028095 34
0.5 1.664506 0.200187 0.062951 0.211889 0.040309 0.053299
35
0.75 0.210820 0.057664 0.026479 0.025437 0.038700 0.123125
87
Table 4-22. Continued
condition Sample
size Testlet
No. Local effect 0.2 0.10.2 0.00.1 0.10.0 0.20.1 0.2
mean.RMSE1 mean.RMSE2 mean.RMSE3 mean.RMSE4 mean.RMSE5 mean.RMSE6
36
1 0.118271 0.041488 0.021033 0.021938 0.047101 0.131175 Overall mean
0.148506 0.038371 0.020659 0.025713 0.036894 0.101726
Standard Deviation 0.262839 0.029611 0.009735 0.032806 0.013939 0.070900
88
Table 4-23. NBOME LEVEL-2 Block 1 Item WMSE Item Testlet Model Rasch Model Partial Credit Model
1 1.03 1.01 1 2 1.03 1.03 1.02 3 1.01 1.01 1 4 1.19 1.04 1.06 5 1.06 1.03 1.02 6 0.99 0.99 0.99 7 0.99 0.98 0.97 8 1.08 1 1 9 0.97 0.98 0.98 10 1.04 1.02 1.01 11 1.03 1 0.99 12 1.03 1.01 1 13 1.04 1.01 1 14 1.04 1.04 1.02 15 1.09 1.01 0.99 16 1.03 1.03 1 17 1.05 1.03 1 18 1.1 1.04 1.01 19 1.04 1.01 0.99 20 1 1.01 1 21 1 0.98 0.98 22 1.02 1.04 1.01 23 1.06 1 0.99 24 1.04 1.03 1.01 25 1.07 1.02 1.02 26 1.05 1.02 1.02 27 1.02 1.02 1 28 1.14 1.02 1.03 29 1.03 0.99 1.03 30 0.97 1 1.01 31 1.06 1 1.01 32 1.01 0.99 0.97 33 1.09 0.99 0.99 34 1.13 1.18 1.03 35 1.04 1 1.02 36 1 1.04 1.07 37 0.93 0.99 1.02 38 1.04 1.01 39 1.05 0.97 40 1.07 0.98 41 1.03 1 42 1.01 0.97 43 0.96 1 44 0.96 1.03 45 1.01 0.99 46 0.86 0.99 47 1.06 1.02 48 1.16 1.02 49 1.02 1.01 50 1.08 1.02
MAX 1.19 1.18 1.07 MIN 0.86 0.97 0.97
MEAN 1.0362 1.012 1.007027 SD 0.055729 0.0309 0.021064
89
Table 4-24. COMLEX-Level 2 2008 block-1 local item dependence detection results sequence Significant results Item pair P-value
1 item3, item23 0.002 2 item5, item42 0.009 3 item6, item19 0.000 4 item6, item21 0.000 5 item7, item49 0.007 6 item9, item 39 0.000 7 item9, item40 0.004 8 item10, item36 0.009 9 item11, item15 0.002 10 item11, item35 0.009 11 item12, item43 0.009 12 item19, item21 0.004 13 item20, item30 0.002 14 item28, item31 0.000 15 item29, item30 0.000 16 item29, item39 0.002 17 item30,item31 0.009 18 item32, item33 0.000 19 item35, item42 0.000 20 item37, item38 0.004 21 item37, item40 0.000 22 item39, item40 0.000 23 item41, item42 0.000 24 item43, item44 0.000 25 item45, item46 0.000 26 item47, item48 0.000
Note: Number of sampled matrices: 450; Number of Item-Pairs tested: 1225;
Item-Pairs with one-sided p < 0.01
90
CHAPTER 5 DISCUSSION
5.1 General Discussion
In accordance with the simulation results and the empirical case results, several
empirical findings related to testlet modeling emerged in this study. First, our results
suggest that the Partial Credit model and Rasch testlet model performed better than the
standard Rasch model under the small and medium testlet size circumstances. No
sufficient evidences indicate which model performs better with regard to the
performance comparison between the partial credit model and the Rasch testlet model.
The results also show that sample size has a significant effect on the analysis results for
the three models. As the sample size increases, the discrepancies between model
estimates and the real data set increases. Also, the degree of the test reliability
overestimation for the standard Rasch model increases when the sample size
increases. In addition, when the testlet size keeps at medium level (i.e. testlet size <5),
the ratio of the independent/testlet item within a test in terms of the number of testlets
plays a major role in the efficiency of models to recover examinee‟s ability parameters.
Second, the findings display that there is no obvious difference of the test reliability
estimates between the Partial Credit model and the Rasch testlet model. This “no
difference” finding indicates that under the small testlet size circumstance, employing
polytomous IRT model to testlets is an approach which does not result in a reduction in
test reliability. Previous concerns about test reliability reduction by applying polytomous
IRT model to the testlets (Keller, Swaminathan, &Sireci, 2003) are not as severe as we
expected in the small and medium testlet size situations. We believe that the small
91
number of parameters dropped when the polytomous items are formed do not
drastically hurt the estimates of reliability of the entire test when the testlet size is small.
Third, the standard error of measurement results from the ability parameter
estimation suggests that the standard Rasch model apparently underestimates standard
error of measurement compared with other two models.
Fourth, the bias and RMSE results from the process of the ability parameter
recovery indicates that no evident pattern can be found to reveal the association
between the factor variations (i.e., testlet size, the sample size, the number of testlets
within a test) and the bias/RMSE result changes. The magnitude of the local item
effects does not have an evident impact on the accuracy of the ability estimation.
However, this study only investigates a small range of the local dependence effects (i.e.
[0,1]). A broader range of the local dependence effect is worthy of more investigation.
Using EAP estimates has major effects on the bias/RMSE result changes at both tails of
ability distribution. In sum, because these three models are all Rasch type model, the
precision of the ability parameter recovery for these three models is relatively well. All
three Rasch type models do show some robustness, to some extent, when face up to
the violation of local item independence assumption.
5.2 Limitations and Suggestions for Future Research
Although there is no obvious discrepancy of the test reliability estimates between
the Rasch testlet model and the Partial Credit model, some parameters are dropped
from the polytomous IRT model compared to the dichotomous IRT model application.
Therefore, because of this parameter dropping issue, a decrease in the reliability is still
expected (Sireci, et al., 1991). The question is whether the test reliability decrease is
due to the change of the test format (i.e. from dichotomous items to a polytomous item
92
within a testlet) or is due to the local item dependence within testlets. Therefore, the
“faking testlet” was proposed by Yen (1993) and applied by Zenisky, et al. (2002). The
“faking testlet” is formed randomly by the independent dichotomous items in a test. By
comparing the test reliability estimates between the test with “real testlet” and the test
with “fake testlet”, we can answer this aforementioned question. Thus, the true cause of
the test reliability reduction will be obtained. Because of limited time, we do not
generate the “faking testlet” in this study to compare the test reliability differences of
these two aforementioned situations. For future research, it is worthwhile to include
investigations of reliability discrepancies between the “faking testlet” and the “real
testlet” situations.
5.3 Conclusion
This study compares the performance of three different models in small and
medium testlet size situations across changes in sample size, variations of the ratio of
independent items to testlet items within a test, and changes of the local item effects.
The study findings indicate that using the polytomous IRT model for testlet item
analyses is still efficient for small testlet size and non-adaptive typed tests. Although,
under this small testlet size situation, the Rasch testlet model and the Partial Credit
model both show better performances than the standard Rasch model, having a large
proportion of testlet items in a test will result in the instability of the Rasch testlet model,
for the large number of MLE non-convergence rate occurrences in Rasch testlet model
application. For small testlet sizes, polytomous IRT models are more stable than the
Rasch testlet model when there are a large number of the testlets included in a test.
This instability may be caused by the multidimensionality feature of the Rasch testlet
93
model. The relationship between the model instability and its multidimensionality is
worthy of further investigation.
Furthermore, the analysis efficiency of models should also be considered for
testlet analysis model selection. The simulations were conducted via personal computer
with a 2.83-GHz Intel Xeon inside. It took 1459771.40 seconds (i.e. approximately 405.5
hours) to complete 12 conditions with the Rasch testlet model data simulation and
analysis. It only took 48659.05 seconds (i.e. approximately 13.5 hours) for the Partial
Credit model to complete the data simulation and analysis for the corresponding
conditions. Typically, using ConQuest took approximately 45 to 70 minutes for a single
calibration of the Rasch testlet model, but it only took approximately 5-8 minutes for a
single calibration of the Partial Credit model.
The investigation of the models used to analyze the testlet items based on the
small testlet size circumstances, provides guidance for model selection for future testlet-
type data analysis. The polytomous IRT model and Rasch testlet model offers an
advantage over the standard Rasch model as it avoids standard error of measurement
underestimation and better ability parameter estimations in the small testlet size
situations.
94
LIST OF REFERENCES
Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1), 1–23.
Armstrong, Ronald D (2004). Computerized Adaptive Testing With Multiple-Form Structures. Applied psychological measurement, 28(3), 147-164.
Andrich, D. (1978). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.
Ariel, A., Veldkamp, B.P., Breithaupt, K. (2006). Optimal Testlet Pool Assembly for Multistage Testing Designs. Applied Psychological Measurement, 30(3), 204 215.
Baldwin, S.G. (2007). A review of Testlet response theory and its applications. Journal of Educational and Behavioral Statistics, 32(3), 333-336.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.
Bock, R. D.,& Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153-168.
Brandt, S. (2008) Estimation of a Rasch Model Including Subdimensions. IERI Monograph Series Issues and Methodologies in Large-Scale Assessments, 1,51-70.
Breithaupt, K,Ariel, A, Veldkamp, B.P.(2005). Automated Simultaneous Assembly for Multistage Testing. International Journal of Testing, 5(3), 319-330.
Breithaupt, Krista (2007). Automated Simultaneous Assembly of Multistage Testlets for a High-12.Stakes Licensing Examination. Educational and psychological measurement, 67 (1), 5-20.
Chen, W.H., Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics. 22(3), 265-289.
Davis, Laurie Laughlin (2003). Item Exposure Constraints for Testlets in the Verbal Reasoning Section of the MCAT. Applied psychological measurement, 27(5), 335-356.
DeMars,C.E. (2006).Application of the Bi-Factor Multidimensional Item Response Theory Model to Testlet-Based Tests. Journal of Educational Measurement 43, ( 2). 145–168.
95
Feldt, L. S.(2002). Estimating the internal consistency reliability of tests composed of testlets varying in length. Applied Measurement in Education, 15(1), 33-48.
Fischer, G.H. (1974). Einfahrung in die Theoriepsychologischer Tests [Introduction to mental test theory]. Berne: Huber.
Gessaroli, M.E., Folske, J. C.(2002). Generalizing the Reliability of Tests Comprised of Testlets. International Journal of Testing, 2(3-4), 277-295.
Habing, B., Roussos, Louis A.(2003). On the need for negative local item dependence. Psychometrika, 68(3), 435-451.
Haertel, E.H. (2006). Reliability. In R.L. Brennan (Ed.). Educational Measurement (4th ed., 65-110). Westport, CT: American Council on Education and Praeger.
Hambleton, R.K. & Murray, L.N. (1983). Some goodness of fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory (pp.71-94). Vancouver BC: Educational Research Institute of British Columbia.
Hambleton, R.K.& Swaminathan, H.(1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic Publishers.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. (Vol. 2). Newbury Park, CA: Sage Publications.
Hendrickson, A.(2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26(2), 44-52.
Holland, P.W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577-602
Ip, E.H., Smits, D.J.M., De Boeck,P.(2009). Locally dependent linear logistic test model with person covariates. Applied Psychological Measurement, 33(7), 555-569.
Jang, E.E., Roussos, L.(2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44(1), 1-21.
Keller, L.A. , Swaminathan, H., &Sireci, S.G.(2003). Evaluating Scoring Procedures for Context-Dependent Item Sets1, Applied Measurement in Education, 16(3), 207 – 222
Kim, J.K. & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psychometrika, 58, 587-599.
Lee, G.M., Frisbie, D. A.(1999). Estimating reliability under a generalizability theory model for test scores composed of testlets. Applied Measurement in Education,12(3). 237-255.
96
Lee, G.M. (2000). A comparison of methods of estimating conditional standard errors of measurement for testlet-based test scores using simulation techniques.; Journal of Educational Measurement, 37(2), 91-112.
Lee, G.M. (2000). Estimating conditional standard errors of measurement for tests composed of testlets. Applied Measurement in Education, 13(2), 161-180.
Lee, G.M. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied psychological measurement, 25(4), 357-372.
Lee, G.M., Dunbar, S.B., Frisbie, D.A. (2001). The relative approapriateness of eight measurement models for analyzing scores from tests composed of testlets. Educational and Psychological Measurement, 61(6), 958-975.
Li, Y.M. (2005). A Test Characteristic Curve Linking Method for the Testlet Model. Applied psychological measurement,29(5), 340-356.
Li, Y.M. (2006). A Comparison of Alternative Models for Testlets. Applied psychological measurement. 30(1), 3-21.
Loevinger, J. (1947). A systematic approach to the construction and evaluation of tests of ability. (Psychological Monographs 61, No.4). Richmond, VA: Psychometric Society.
Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No. 7.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.
Lord, F. M., Novick, M. R. (1968) Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.
Luecht, R., Brumfield, T., Breithaupt, K. (2006). A Testlet Assembly Design for Adaptive Multistage Tests. Applied Measurement in Education, 19(3), 189-202.
Mair, P., Hatzinger, R. (2007). Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R. Journal of Statistical Software, 20(9), 1-20.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
Meijer, Rob R. (2004). Using Patterns of Summed Scores in Paper-and-Pencil Tests and Computer-Adaptive Tests to Detect Misfitting Item Score Patterns. Journal of Educational Measurement, 41(2), 119-136.
Mislevey, R.J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195.
97
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.
Pitt, M.A., Kim, W., & Myung, I.J.(2003). Flexibility versus Generalizability in Model Selection. Psychonomic Bulletin & Review, 10, 29-44.
Pomplun, M., Ritchie, T. (2004). An Investigation of Context Effects for item Randomization within Testlets. Journal of Educational Computing Research, 30(3), 243-254.
Ponocny, I. (2001) Nonparametric goodness-of-fit tests for the rasch model. Psychometrika, 66(3), 437-460
Puhan, G. Moses, T.P., Grant, M.C., McHale, F. (2009). Small-sample equating using a single-group nearly equivalent test (SiGNET) design. Journal of Educational Measurement, 46(3), 344-362.
R Development Core Team (2006). R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URLhttp://www.R-project.org
Rae, G. (2008). A note on using alpha and stratified alpha to estimate the reliability of a test composed of item parcels. British Journal of Mathematical and Statistical Psychology, 61(2), 515-525.
Rivera, C., Stansfield, C.W.(2003). The effect of linguistic simplification of science test items on score comparability. Educational Assessment, 9(3-4), 79-105.
Rosenbaum, P. R. (1988). Item bundles. Psychometrika, 53, 349-359.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 17, 1-100.
Schmitt, N. (2002). Do reactions to tests produce changes in the construct measured? Multivariate Behavioral Research, 37(1), 105-126.
Sheehan, K. M., Lewis, C. (1992). Computerized mastery testing with nonequivalent testlets. Applied Psychological Measurement, 16(1), 65-76.
Shen, L., Yen, J. (1997). Item dependency in medical licensing examinations. Academic Medicine.72, S, S19-S21
Sireci, S. G., Thissen, D.,&Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237-247.
Stark, S., Chernyshenko, O.S. & Drasgow, F. (2004) Investigating the effects of local dependence on the accuracy of IRT ability estimation. Technical Report Series two. American Institute of Certified Public Accountants.
98
Steinberg, L., Thissen, D.(1996).Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1(1), 81-97.
Thissen, D., Billeaud, K., McLeod, L., & Nelson, L (1997). A brief introduction to item response theory for items scored in more than two categories. Paper presented at the National Assessment Governing Board Achievement Levels Workshop, Boulder, CO.
Thissen, D., Steinberg, L., & Mooney, J. (1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement, 26, 247-260.
Thissen, D. (2008). Review of 'Testlet response theory and its applications.'. Journal of Educational Measurement, 45(3), 305-308.
Tokar, D. M.; Fischer, A.R., Snell, A.F., Harik-Williams, N. (1999). Efficient assessment of the five-factor model of personality: Structural validity analyses of the NEO Five-Factor.
Tong, Y., Kolen, M.J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253.
Van den wollenberg, A. L. (1982). Two new test statistics for the Rasch model. Psychometrika, 47, 123-140.
Vitacco, M. J.(2005). A Comparison of Factor Models on the PCL-R with Mentally Disordered Offenders: The Development of a Four-Factor Model. Criminal justice and behavior, 32(5) 526-545.
Wainer. H., Lewis. C. (1990). Toward a Psychometrics for Testlets. Journal of Educational Measurement. 27(1), 1-14.
Wainer, H, Lewis, C, Kaplan, B., Braswell, J.(1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement, 28(4), 311-323.
Wainer, H. & Kiely, G, L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201.
Wainer, H. & Mislevey, R. J. (2000). Item response theory, item calibration, and proficiency estimation. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed.) Mahwah, NH: Lawrence Erlbaum Associates.
Wainer, H., Sireci, S.G. Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal of Educational Measurement, 28(3), 197-219.
99
Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 LawSchool Admissions Test as an example. Applied Measurement in Education, 8, 157-186.
Wainer, H., &Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203-220.
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model using in testlet-based adaptive testing. In W. van der Linden & C. A.
W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245-
269). London: Kluwer.
Wang, W.-C., & Wilson, M. (2005a). Exploring local item dependence using a random- effects facet model. Applied Psychological Measurement, 29(4), 296–318.
Wang, W.-C., & Wilson, M. (2005b). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126–149.
Wang, W.C. (2005). Assessment of Differential Item Functioning in Testlet-Based Items Using the Rasch Testlet Model. Educational and psychological measurement, 65(4), 549-579.
Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109-128.
Wang, X.H., (2002). A general Bayesian model for testlets: Theory and applications. Applied psychological measurement, 26(1), 109-128.
Weaver, C.M., Meyer, R.G., Van Nort, J.J., Tristan, L. (2006). Two-, Three-, and Four-Factor PCL-R Models in Applied Sex Offender Risk Assessments. Assessment, 13(2), 208-216.
Wilson, M., Adams, R.J. (1995). Rasch Models for Item Bundles. Psychometrika, 60(2), 181-198.
Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalized item response modeling software. Melbourne, VIC: Australian Council for Educational Research.
Yang, W.L.& Gao,R. (2008). Invariance of Score Linkings Across Gender Groups for Forms of a Testlet-Based College-Level Examination Program Examination. Applied Psychological Measurement 32, 45
Yen, W. (1993). Scaling performance assessment: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213.
100
Zenisky, R. K., Hambleton, S.G. Sireci. (2002) Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test. Journal of Educational Measurement, 39(4), 291-309.
Zwick, R. (2002). Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied psychological measurement, 26(1), 57-77.
101
BIOGRAPHICAL SKETCH
Ou Zhang was born in Chengdu, China. He completed his Bachelor of Science in
computer science from Chengdu University of Technology in 2001 and his Master of
Education in Educational Research Measurement and Evaluation from Boston College
in 2007. He received his Master of Arts in Education degree from the program of
Research and Evaluation Methodology at University of Florida in the fall of 2010. He is
currently enrolled in the Ph.D. program of Research and Evaluation Methodology at
University of Florida.