modelos tri de 2 e 3 parâmetros - labape

28
Modelos TRI de 2 e 3 parâmetros Ricardo Primi USF 2019 Programa de Pós Graduação Stricto Sensu em Psicologia

Upload: others

Post on 24-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modelos TRI de 2 e 3 parâmetros - LabAPE

Modelos TRI de 2 e 3 parâmetros

Ricardo PrimiUSF 2019Programa de Pós Graduação Stricto Sensu em Psicologia

Page 2: Modelos TRI de 2 e 3 parâmetros - LabAPE

Capítulo 10

Educational Measurement for Applied Researchers

Margaret Wu · Hak Ping Tam Tsung-Hau Jen

Theory into Practice

Page 3: Modelos TRI de 2 e 3 parâmetros - LabAPE

Modelo de 2 parámetros: bi e ai

Chapter 10Two-Parameter IRT Models

Introduction

The Rasch model is sometimes also called the one-parameter IRT model in that theprobability of success as a function of the ability h has only one parameter (the itemdifficulty parameter) estimated for each item, as shown in Eq. (10.1)

p ¼ P X ¼ 1ð Þ ¼ exp h$ dð Þ1þ exp h$ dð Þ

ð10:1Þ

The Rasch model assumes that items all have the same discrimination, in that theitem characteristic curves are parallel, as shown in Chap. 7. In contrast, a dis-crimination parameter can be incorporated into Eq. (10.1) thereby extending it to amore general mathematical model as seen in Eq. (10.2)

p ¼ P X ¼ 1ð Þ ¼ exp a h$ dð Þð Þ1þ exp a h$ dð Þð Þ ð10:2Þ

In Eq. (10.2), the parameter a is called the discrimination parameter (or slopeparameter), in addition to the item difficulty parameter d. This model is usuallyknown as the 2PL model (2-parameter logistic model). The parameter a is a scalefactor of the ability scale, since it is multiplied to h$ dð Þ, where h and d are in logitunit on the ability scale. Such a multiplying factor has the effect of stretching orshrinking the ability scale, in the same way as one imagines using the Windowsre-size tool (⇔) to change the horizontal scale of a picture, as illustrated inFig. 10.1. Two items with the same item difficulty (d ¼ 0:8) but different dis-crimination parameters, a ¼ 0:6 and a ¼ 1:7 respectively are shown in the left andright graphs of Fig. 10.1.

It can be seen from Fig. 10.1 that the larger the value of a, the steeper the ICC,and the more discriminating an item is. In contrast, a very flat ICC indicates that

© Springer Nature Singapore Pte Ltd. 2016M. Wu et al., Educational Measurement for Applied Researchers,DOI 10.1007/978-981-10-3302-5_10

187

low and high ability students have similar chances of obtaining the correct answer,so the item is not very discriminating. To make the left-side curve in Fig. 10.1steeper, we need to shrink the scale. To make the right-side curve flatter, the graphneeds to be stretched horizontally. In this way, it can be seen that highly dis-criminating items can separate students more than low discrimination items can.

Discrimination Parameter as Score of an Item

In Chap. 9, partial credit item scoring is discussed. In particular, the maximumscore of a partial credit item is a weight of the item in the whole test, and thismaximum score should be set in relation to the item discrimination (not itemdifficulty). Recall that in Chap. 9, the probability of scoring a 2 in a 3-category (0,1, 2) item is

p2 ¼ Pr X ¼ 2ð Þ ¼ exp 2h$ d1þ d2ð Þð Þ1þ exp h$ d1ð Þþ exp 2h$ d1 þ d2ð Þð Þ ð10:3Þ

In Eq. (10.3), it can be seen that the maximum score of a partial credit item (inthis case, 2) relates to the multiplier of h (2h in the numerator of Eq. (10.3)), similarto the a parameter in the 2PL model. In this sense, the a parameter of 2PL modelcan be regarded as the score of an item. One difference between the 2PL model andthe partial credit model is that item scores are estimated from the item response datain 2PL, and not set by the test writer as for the partial credit model.

More generally, item “scores” or item weights in 2PL are estimated for everyitem, including dichotomous and partial credit items. The following is an exampleof the differences between the Rasch model and the 2PL model for a set ofdichotomously scored items.

Fig. 10.1 2PL ICC with a = 0.6 (left graph) and a = 1.7 (right graph)

188 10 Two-Parameter IRT Models

Page 4: Modelos TRI de 2 e 3 parâmetros - LabAPE

Parâmetro a como escore máximo do item e relação com PCM

low and high ability students have similar chances of obtaining the correct answer,so the item is not very discriminating. To make the left-side curve in Fig. 10.1steeper, we need to shrink the scale. To make the right-side curve flatter, the graphneeds to be stretched horizontally. In this way, it can be seen that highly dis-criminating items can separate students more than low discrimination items can.

Discrimination Parameter as Score of an Item

In Chap. 9, partial credit item scoring is discussed. In particular, the maximumscore of a partial credit item is a weight of the item in the whole test, and thismaximum score should be set in relation to the item discrimination (not itemdifficulty). Recall that in Chap. 9, the probability of scoring a 2 in a 3-category (0,1, 2) item is

p2 ¼ Pr X ¼ 2ð Þ ¼ exp 2h$ d1þ d2ð Þð Þ1þ exp h$ d1ð Þþ exp 2h$ d1 þ d2ð Þð Þ ð10:3Þ

In Eq. (10.3), it can be seen that the maximum score of a partial credit item (inthis case, 2) relates to the multiplier of h (2h in the numerator of Eq. (10.3)), similarto the a parameter in the 2PL model. In this sense, the a parameter of 2PL modelcan be regarded as the score of an item. One difference between the 2PL model andthe partial credit model is that item scores are estimated from the item response datain 2PL, and not set by the test writer as for the partial credit model.

More generally, item “scores” or item weights in 2PL are estimated for everyitem, including dichotomous and partial credit items. The following is an exampleof the differences between the Rasch model and the 2PL model for a set ofdichotomously scored items.

Fig. 10.1 2PL ICC with a = 0.6 (left graph) and a = 1.7 (right graph)

188 10 Two-Parameter IRT Models

Page 5: Modelos TRI de 2 e 3 parâmetros - LabAPE

Exemplo de análise dom modelo de 1 e 2 parâmetros: modelo de rasch

item can separate low and high ability students better than the “number of days inthe calendar” item. One may also conjecture that knowing the number of days inJune and July may not be directly related to a student’s general mathematics ability.

Such a post hoc analysis of items can suggest reasons for item difficulty and itemdiscrimination. However, prior to the administration of test items, it may be difficultto guesstimate item discrimination. Test writers can often gauge the item difficultyfrom an analysis of cognitive load for an item or from curriculum progression of anitem topic. But item discrimination is not so easy to predict. In general, constructedresponse items have higher discrimination than multiple-choice items (irrespectiveof item difficulty) because of the chance of guessing in multiple-choice items.

We further note that item difficulty parameter and item slope parameter are twodifferent and unrelated statistics. Unlike CTT item discrimination/point-biserialcorrelation statistics which relate to item difficulty, the IRT d and a parameters are

Fig. 10.2 ICC of two example items showing “under-fit” and “over-fit”

Item 2Mike reached Sydney on 13th June in the morning and leŌ on 4th August in the night. For how many days did Mike stay in Sydney?

(1) 53 days(2) 52 days(3) 51 days(4) 50 days

Item 5Solve the following

7895- 5704

______

(1) 1191(2) 2191(3) 2101(4) 1101

Fig. 10.3 Item 2 and item 5 in the example test

190 10 Two-Parameter IRT Models

An Example Analysis of Dichotomous ItemsUsing Rasch and 2PL Models

The data set of this example contains item responses of 2987 students to a math-ematics test with 13 multiple-choice items. First, a Rasch analysis is carried out. Inthis analysis, each item is scored 1 for correct answer and 0 for incorrect answer.Table 10.1 shows the item statistics for the 13 items from the Rasch and CTTanalyses.

It can be seen from Table 10.1 that at least a few items do not fit the Raschmodel well. For example, item 2 and item 13 have large fit mean squares values andlower discrimination indices. In contrast, items 4, 5, and 6 “over-fit” the model,with high discrimination indices. Figure 10.2 shows the ICC of item 2 and item 5 astwo example items.

Figure 10.2 shows that an “under-fit” item corresponds to low discrimination, orflatter observed ICC, and an “over-fit” item corresponds to high discrimination, or asteep observed ICC. As discussed in Chap. 8, the residual-based fit statistics reflectthe slope of the observed ICC against the theoretical ICC.

Item 2 and item 5 are shown in Fig. 10.3 to offer some suggestions as to whyone item is not very discriminating while the other one is.

The first observation is that item 2 is a much more difficulty item than item 5 (seeitem difficulty estimates in Table 10.1). Item 2 is a word problem, requiring stu-dents to know the number of days in June and July. Further, it is unclear whetherthe two end days should both be counted. That is, if a person arrives on the 1st ofJune and leaves on the 2nd, it is unclear whether this is counted as one day or twodays. In this test, the correct answer is 53 days, which includes both end days. Incontrast, item 5 is a computation item. It requires students to know subtractionprocedures without borrowing. There is a clear correct answer. It is not a difficultyitem, but yet this item is more discriminating than item 2. That is, the computation

Table 10.1 Rasch and CTTitem statistics of a set ofdichotomous items

Difficulty Infit MS Infit t Pt-bis corrM_1 −1.43 1.03 1.24 0.45M_2 0.80 1.30 13.66 0.32M_3 −0.44 0.96 −2.37 0.58M_4 −0.02 0.90 −5.66 0.62M_5 −0.73 0.88 −6.58 0.62M_6 0.28 0.92 −4.34 0.61M_7 −0.28 0.99 −0.31 0.55M_8 0.13 0.97 −1.70 0.57M_9 0.46 0.99 −0.49 0.56M_10 −0.18 0.94 −3.61 0.59M_11 0.19 1.00 −0.16 0.56M_12 0.53 0.96 −1.97 0.57M_13 0.67 1.20 9.72 0.38

An Example Analysis of Dichotomous Items Using Rasch and 2PL Models 189

Page 6: Modelos TRI de 2 e 3 parâmetros - LabAPE

Exemplo de análise dom modelo de 1 e 2 parâmetros: modelo de 2 parâmetros

“unrelated” in the sense that an easy or difficult item can have high or low values ofa. Under IRT, the notions of item difficulty and item discrimination are two distinctconcepts. We discuss this distinction further in the latter part of this chapter.

2PL Analysis

A 2PL analysis is carried out on the same data set. Table 10.2 shows estimated itemdifficulties, slope parameters and fit indices.

Figure 10.4 shows the ICCs of item 2 and item 5 using the 2PL model to fit theitem responses.

Comparing 2PL model with the Rasch model, a number of observations can bemade. First, the item characteristic curves in Fig. 10.4 show that the theoretical (ormodelled) expected scores curves can have different slopes across different items.As a result, the theoretical curves from a 2PL analysis fit the observed curves betterthan for the Rasch model. Checking the fit statistics in Table 10.2, it can be seenthat all items show good fit. In fact, since the residual-based fit statistics detectdeparture of the slope of the observed ICC from the expected ICC, these fit statisticswill necessarily show good fit when the theoretical ICC can have varying slopes tomatch the observed ICC. Consequently, residual-based fit statistics are not usefulfor checking item fit for 2PL models.

To further demonstrate the relationship between residual-based fit statistics anditem discrimination, the Rasch infit mean squares are plotted against the 2PL slopeparameters across all items. Figure 10.5 shows this plot.

Figure 10.5 shows that the larger the residual-based fit statistic, the lower the2PL slope parameter. In other words, when an item “under-fits” the Rasch model,the item is not as discriminating as the Rasch model expects, the 2PL model assigns

Table 10.2 2PL itemstatistics of a set ofdichotomous items

Difficulty Slope parameter Infit MS Infit tM_1 −1.33 1.05 0.99 −0.28M_2 0.64 0.43 1.00 −0.09M_3 −0.50 1.61 1.02 0.91M_4 −0.04 1.94 1.00 −0.11M_5 −0.95 2.07 0.99 −0.23M_6 0.32 1.74 1.00 −0.11M_7 −0.28 1.28 1.00 0.24M_8 0.13 1.44 0.98 −1.02M_9 0.47 1.34 1.01 0.50M_10 −0.20 1.51 1.00 0.16M_11 0.19 1.33 1.01 0.42M_12 0.56 1.49 1.00 −0.19M_13 0.55 0.58 1.00 0.17

An Example Analysis of Dichotomous Items Using Rasch and 2PL Models 191

a lower weight (score) for the item. Essentially, what the 2PL model does is toestimate weights for the items according to the discriminating power of the items.The “worse” (i.e., little discriminating power) an item is, the smaller theitem weight. In the case of the example, item 5 has a much large weight (score) thanitem 2 (see Table 10.2). This makes a great deal of sense. If we believe that item 2has ambiguous correct answer, lowering the weight of this item is one way toprovide “fairer” scoring. In summary, while each item under the Rasch modelcontributes equally towards the total test score, items under the 2PL models con-tribute differently according to the discriminating power of the items.

A plot of the CTT discrimination index against the 2PL slope parameter for eachitem shows again that the slope parameter is a measure of item discrimination. SeeFig. 10.6.

Fig. 10.4 ICC of item 2 and item 5 using 2PL model

Fig. 10.5 Rasch fit statistics plotted against 2PL slope parameter

192 10 Two-Parameter IRT Models

Page 7: Modelos TRI de 2 e 3 parâmetros - LabAPE

To further clarify the relationship between the Rasch model and the 2PL model,two more plots are shown. The first is a plot of the item difficulty estimates obtainedfrom the Rasch model and the 2PL model. See Fig. 10.7.

Figure 10.7 shows that the item difficulty parameters obtained from the Raschmodel and the 2PL model correlate well.

Figure 10.8 shows a plot of item difficulty against slope parameter under the 2PLmodel.

Figure 10.8 shows that there is no discernible relationship between item diffi-culty and the discrimination parameter. This observation reinforces the recom-mendation in Chap. 9 that the maximum score of an item should not be dependenton the item difficulty. In fact, Fig. 10.8 shows that for two difficult items, their slopeparameters are low and hence their weights (scores) are lowered.

Fig. 10.6 CTT point-biserial correlation plotted against 2PL slope parameter

Fig. 10.7 Rasch item difficulty plotted against 2PL item difficulty

An Example Analysis of Dichotomous Items Using Rasch and 2PL Models 193

To further clarify the relationship between the Rasch model and the 2PL model,two more plots are shown. The first is a plot of the item difficulty estimates obtainedfrom the Rasch model and the 2PL model. See Fig. 10.7.

Figure 10.7 shows that the item difficulty parameters obtained from the Raschmodel and the 2PL model correlate well.

Figure 10.8 shows a plot of item difficulty against slope parameter under the 2PLmodel.

Figure 10.8 shows that there is no discernible relationship between item diffi-culty and the discrimination parameter. This observation reinforces the recom-mendation in Chap. 9 that the maximum score of an item should not be dependenton the item difficulty. In fact, Fig. 10.8 shows that for two difficult items, their slopeparameters are low and hence their weights (scores) are lowered.

Fig. 10.6 CTT point-biserial correlation plotted against 2PL slope parameter

Fig. 10.7 Rasch item difficulty plotted against 2PL item difficulty

An Example Analysis of Dichotomous Items Using Rasch and 2PL Models 193

Page 8: Modelos TRI de 2 e 3 parâmetros - LabAPE

Finally, to illustrate the difference between the Rasch model and the 2PL model,a set of theoretical ICCs are plotted in the same graph (i.e., overlay) to show theparallel and non-parallel nature of the curves for the Rasch and 2PL modelsrespectively.

Figure 10.9 shows that the Rasch model fits parallel theoretical ICCs irrespectiveof the slopes of the observed ICCs, while the 2PL model fits theoretical ICCs withdifferent slopes to match that of the observed data.

A Note on the Constraints of Estimated Parameters

In the discussion of the Rasch model in Chap. 7, the indeterminacies of the locationand scale of the latent trait measures are explained. For the 2PL model, similarissues of indeterminacies apply. In particular, the scale factor of the latent trait

Fig. 10.8 2PL item difficulty against 2PL slope parameter

Fig. 10.9 Overlay of theoretical ICCs for the Rasch model (left graph) and the 2PL model (rightgraph)

194 10 Two-Parameter IRT Models

Page 9: Modelos TRI de 2 e 3 parâmetros - LabAPE

Relatividade da métrica da escala ( a e desvio padrão). Restrições para identificação

• média dos parâmetros a = 1 • desvio padrão da distribuição = 1

Page 10: Modelos TRI de 2 e 3 parâmetros - LabAPE

159

A Generalized Partial Credit Model:Application of an EM AlgorithmEiji Muraki, Educational Testing Service

The partial credit model (PCM) with a varyingslope parameter is developed and called thegeneralized partial credit model (GPCM). The itemstep parameter of this model is decomposed to alocation and a threshold parameter, followingAndrich’s (1978) rating scale formulation. The EMalgorithm for estimating the model parameters isderived. The performance of this generalizedmodel is compared on both simulated and realdata to a Rasch family of polytomous itemresponse models. Simulated data were generatedand then analyzed by the various polytomous itemresponse models. The results demonstrate that the

rating formulation of the GPCM is quite adaptable

to the analysis of polytomous item responses.The real data used in this study consisted of theNational Assessment of Educational Progress(Johnson & Allen, 1992) mathematics data thatused both dichotomous and polytomous items.The PCM was applied to these data using bothconstant and varying slope parameters. The GPCM,which provides for varying slope parameters,yielded better fit to the data than did the PCM.Index terms: item response model, National Assess-ment of Educational Progress, nominal responsemodel, partial credit model, polytomous responsemodel, rating scale model.

If responses to a test item are classified into two categories, dichotomous item response modelscan be applied. When responses to an item have more than two categories, a polytomous item responsemodel is appropriate for the analysis of the responses. If the options on a rating scale are successivelyordered, applicable models include the graded response model (GRM) (Samejima, 1969) and its ratingscale version (Muraki, 1990a), or the partial credit model (PCM) (Masters, 1982) and its rating scaleversion (Andrich, 1978). For a test item in which the response options are not necessarily ordered,Bock (1972) proposed the nominal response model (NRM). The dichotomous item response modelcan be thought of as a special case of the polytomous item response model in which the numberof categories is two.

The Partial Credit Model

A Rasch F Ily of Polytomous Item Response Models

Although the Rasch (1960) dichotomous model was developed independently of the latent traitmodels of Birnbaum (1968) and Lord (1980), the basic difference between the Rasch and the othermodels is the introduction of the assumption about the discriminating power of test items. Thesemodels share the following common form:

which expresses the probability of person i, whose ability is parameterized by latent trait 0, correctlyresponding to an itern j (Uj = 1). The parameter bj usually refers to item difficulty. If the

by guest on December 3, 2014apm.sagepub.comDownloaded from

Page 11: Modelos TRI de 2 e 3 parâmetros - LabAPE

Choosing Between the Rasch Model and 2PL Model

In Chaps. 6 and 7, the desirable measurement properties of the Rasch model havebeen presented. So why would one choose the 2PL model over the Rasch model?Here are some possible reasons. When the Rasch model is used, the good propertiesof the model can only be realised if the data fit the model. Given a particular dataset, such as the example provided in this chapter, mis-fitting items do not get“fixed” by running a Rasch analysis. That is, the properties of the Rasch model arenot attained since the observed ICCs are not “parallel” even though the theoreticalICCs are (forced to be). Under such circumstances, one may decide to choose amodel that fit the data better, and make the best use of the information available,since choosing a mis-fitting model will not provide the properties of the model.Rasch models are useful for the construction of an instrument if there are possi-bilities of modifying and deleting items. If a test has already been administered andthe data do not fit the Rasch model, there is no gain in using the Rasch model. Infact, there are some gains in using a model that fit the data.

In real-life, no item response data will fit a theoretical model perfectly, since themodels are mathematical functions. The more parameters a model has, the morelikely the data will fit the model. There is no real-data set that will fit the Raschmodel perfectly (nor a 2PL model, for that matter), so we need to make anassessment of how good a fit is good enough. From a practical point of view, itprobably makes little difference whether Rasch model or 2PL models are fitted if wehave quality items in a test. If the data fit the Rasch model well, then a 2PL modelfitted to the data set will also have similar slopes across items. Although from atheoretical point of view, the good properties of measurement should be upheld atleast as a goal to achieve when instruments are constructed. In practice, there needsnot be a clear demarcation when it comes to choosing IRT models, but a goodunderstanding of the implications of each model and model-fit is important.

2PL Models for Partial Credit Items

An extension of the 2PL to the partial credit items is the generalised partial creditmodel (GPCM) (Muraki 1992). As for the dichotomous case, a discriminationparameter is added to the partial credit model presented in Chap. 9. Eq. (10.3)shows the GPCM.

Pr Xni ¼ xð Þ ¼exp

Pxk¼0 ai hn $ dikð Þ

Pmih¼0 exp

Phk¼0 ai hn $ dikð Þ

ð10:3Þ

Dropping the index i for item number for simplicity, a 3-category partial credititem has the following probabilities (see Eqs. (10.4)–(10.6)).

Choosing Between the Rasch Model and 2PL Model 197

p0 ¼ Pr X ¼ 0ð Þ ¼ 11þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ

ð10:4Þ

p1 ¼ Pr X ¼ 1ð Þ ¼ expa h% d1ð Þ1þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ ð10:5Þ

p2 ¼ Pr X ¼ 2ð Þ ¼ expa 2h% d1 þ d2ð Þð Þ1þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ

ð10:6Þ

An Example Data Set

As an example, the data set analysed in Chap. 9 is re-run using the generalisedpartial credit model. Table 10.5 shows estimates of the slope parameters, a, and theitem difficulty parameters for the first 10 items. The slope parameters have beentransformed (scaled up) so the maximum score on the test is 68, matching that ofthe Rasch model.

Under the generalised partial credit model, a slope (or discrimination) parameter isestimated for every item, despite the number of response categories within the item.In the example give in Table 10.5, the value of the slope parameter varies acrossitems in relation to the discriminating power of the item. As an example to illustratethe differences between the 2PL (GPCM) and the Rasch model (PCM), Fig. 10.10shows a comparison of the ICCs between the two models for item 23 in the data set.

The slope parameter for item 23 is 0.49, indicating that the weight estimated forthis item under GPCM is smaller than that assigned by the PCM. That is, the item isnot as discriminating as the PCM model assumes. Similar to the 2PL dichotomousitems, a weight is applied to weigh up or down the contribution of a PCM item tothe total score. Within the item though, the weights of the categories are still ininteger multiples, i.e., 1, 2, 3, etc. For example, for item 23, the weight (or score) of

Table 10.5 Slope parameterand item difficulty parameters

Item a (transformed) d1 d2 d3 d41 0.76 −1.582 1.04 −1.933 0.95 0.70 0.734 0.70 0.24 1.03 0.58 −1.885 0.55 2.07 −2.766 1.14 −1.567 0.94 −0.748 1.30 −1.889 1.04 0.4710 0.86 0.28 0.42

198 10 Two-Parameter IRT Models

Page 12: Modelos TRI de 2 e 3 parâmetros - LabAPE

p0 ¼ Pr X ¼ 0ð Þ ¼ 11þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ

ð10:4Þ

p1 ¼ Pr X ¼ 1ð Þ ¼ expa h% d1ð Þ1þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ ð10:5Þ

p2 ¼ Pr X ¼ 2ð Þ ¼ expa 2h% d1 þ d2ð Þð Þ1þ expa h% d1ð Þþ expa 2h% d1þ d2ð Þð Þ

ð10:6Þ

An Example Data Set

As an example, the data set analysed in Chap. 9 is re-run using the generalisedpartial credit model. Table 10.5 shows estimates of the slope parameters, a, and theitem difficulty parameters for the first 10 items. The slope parameters have beentransformed (scaled up) so the maximum score on the test is 68, matching that ofthe Rasch model.

Under the generalised partial credit model, a slope (or discrimination) parameter isestimated for every item, despite the number of response categories within the item.In the example give in Table 10.5, the value of the slope parameter varies acrossitems in relation to the discriminating power of the item. As an example to illustratethe differences between the 2PL (GPCM) and the Rasch model (PCM), Fig. 10.10shows a comparison of the ICCs between the two models for item 23 in the data set.

The slope parameter for item 23 is 0.49, indicating that the weight estimated forthis item under GPCM is smaller than that assigned by the PCM. That is, the item isnot as discriminating as the PCM model assumes. Similar to the 2PL dichotomousitems, a weight is applied to weigh up or down the contribution of a PCM item tothe total score. Within the item though, the weights of the categories are still ininteger multiples, i.e., 1, 2, 3, etc. For example, for item 23, the weight (or score) of

Table 10.5 Slope parameterand item difficulty parameters

Item a (transformed) d1 d2 d3 d41 0.76 −1.582 1.04 −1.933 0.95 0.70 0.734 0.70 0.24 1.03 0.58 −1.885 0.55 2.07 −2.766 1.14 −1.567 0.94 −0.748 1.30 −1.889 1.04 0.4710 0.86 0.28 0.42

198 10 Two-Parameter IRT Models

category 1 is 0.49 (the slope parameter), the weights of categories 2 and 3 are2 × 0.49 and 3 × 0.49, respectively. That is, the coefficient of h in Eqs. (10.3)–(10.6) is akh, where a is the slope parameter and k is the item category number.

For this data set, the test reliability has increased a little from 0.82 for Raschmodel (PCM) to 0.83 for 2PL (GPCM) model. Using 2PL will often increase thetest reliability a little as more weights are assigned to more discriminating items.

A More Generalised Partial Credit Model

For the GPCM, one discrimination parameter is estimated for each item. That is, inEq. (10.3), the slope parameter ai has a subscript i for item i. However, if thesubscript of the slope parameter is ik, then aik denotes a slope parameter for cat-egory k of item i.

Pr Xni ¼ xð Þ ¼ expPx

k¼0 aik hn $ dikð ÞPmi

h¼0 expPh

k¼0 aik hn $ dikð Þð10:7Þ

In this case, there is a weight (or score) assigned to each score category of anitem, and not just one weight for the whole item. This is a more general model thanthe GPCM, since different categories within an item can have different weights.Such models have been implemented in TAM (Kiefer et al. 2013) and in ConQuest(Wu et al. 2007) software programs. In this book, we will call this model KPCM forcategory level PCM. This model is similar to the idea of Bock’s nominal responsemodel (Bock 1972).

As an illustration, we use the data set in Chap. 9 and re-analyse using KPCM.The corresponding results for the first 10 items are shown in Table 10.6.

Fig. 10.10 Item I23 ICC–PCM model (left graph) and GPCM (right graph)

Choosing Between the Rasch Model and 2PL Model 199

Page 13: Modelos TRI de 2 e 3 parâmetros - LabAPE

category 1 is 0.49 (the slope parameter), the weights of categories 2 and 3 are2 × 0.49 and 3 × 0.49, respectively. That is, the coefficient of h in Eqs. (10.3)–(10.6) is akh, where a is the slope parameter and k is the item category number.

For this data set, the test reliability has increased a little from 0.82 for Raschmodel (PCM) to 0.83 for 2PL (GPCM) model. Using 2PL will often increase thetest reliability a little as more weights are assigned to more discriminating items.

A More Generalised Partial Credit Model

For the GPCM, one discrimination parameter is estimated for each item. That is, inEq. (10.3), the slope parameter ai has a subscript i for item i. However, if thesubscript of the slope parameter is ik, then aik denotes a slope parameter for cat-egory k of item i.

Pr Xni ¼ xð Þ ¼ expPx

k¼0 aik hn $ dikð ÞPmi

h¼0 expPh

k¼0 aik hn $ dikð Þð10:7Þ

In this case, there is a weight (or score) assigned to each score category of anitem, and not just one weight for the whole item. This is a more general model thanthe GPCM, since different categories within an item can have different weights.Such models have been implemented in TAM (Kiefer et al. 2013) and in ConQuest(Wu et al. 2007) software programs. In this book, we will call this model KPCM forcategory level PCM. This model is similar to the idea of Bock’s nominal responsemodel (Bock 1972).

As an illustration, we use the data set in Chap. 9 and re-analyse using KPCM.The corresponding results for the first 10 items are shown in Table 10.6.

Fig. 10.10 Item I23 ICC–PCM model (left graph) and GPCM (right graph)

Choosing Between the Rasch Model and 2PL Model 199

Page 14: Modelos TRI de 2 e 3 parâmetros - LabAPE

Table 10.6 shows the slope parameters at item category levels. As discussed, theparameters are regarded as weights, or scores, for the item categories. To put theseinto perspective, we compare the scores of item 4 under the PCM, GPCM andKPCM models. This item has 5 response categories, 0, 1, 2, 3, 4. Under the partialcredit model, the scores are assigned and not estimated, so the scores for the fivecategories are 0, 1, 2, 3, 4. Under the generalised partial credit model, the scores are(see Table 10.5) 0, 0.70, 1.40 (0.70 × 2), 2.10 (0.70 × 3), 2.80 (0.70 × 4) for thefive categories. Under the KPCM, the scores are estimated separately for the itemcategories, and these are (0), 1.12, 3.10, 2.64, 2.95.

In Chap. 9, it is shown that item 4 in the data set shows mis-fit under PCM andthat the item is not as discriminating as the Rasch model expects using the assignedcategory scores. Consequently, a collapsing of categories and lowering of themaximum score lead to a better fit of the item. In this chapter, the estimated scoresunder GPCM and KPCM suggest that the maximum score should be lower. Inparticular, under KPCM, the scores for categories 2, 3 and 4 are similar. If the PCMis still the preferred model, then at least KPCM can suggest how the categories canbe collapsed. In contrast, GPCM lowers the maximum score of the item, but keepsthe relative weights at the category levels in integer multiples (i.e., 0, 1, 2, 3, 4).

Since KPCM has more parameters, the model necessarily will provide a better fitto the data. Figure 10.11 shows the expected scores curve for item 4 under KPCM.

Using KPCM, the reliability has increased slightly to 0.835.

A Note About Item Difficulty and Item Discrimination

Occasionally, there are confusions between the concepts of item difficulty and itemdiscrimination. In particular, such confusion may arise when item-person maps areinterpreted. Figure 10.12 shows an item-person map for two hypothetical partialcredit items.

The item thresholds are plotted for two partial credit items where 1.1 refers toitem 1, step 1, and 1.2 refers to item 1, step 2, etc. For this example, it does not

Table 10.6 Discriminationparameters at item categorylevel under KPCM

Item i ai1 ai2 ai3 ai41 0.762 0.983 0.58 2.184 1.12 3.10 2.64 2.955 0.84 1.086 1.137 0.918 1.249 1.0210 0.93 1.71

200 10 Two-Parameter IRT Models

Page 15: Modelos TRI de 2 e 3 parâmetros - LabAPE

Como os escores na TRI são calculados? Item scoring weights

473Avaliação Psicológica, 2018, 17(4), pp. 473-483

ABSTRACTThis paper studied the application of an extended version of the Item Response Theory, the 4-parameter model (4PL), in the item analysis of Human Figure Drawing (HFD). HFD score drawing might serve as indicators of cognitive development. This model incorporates an upper asymptotic parameter (parameter d) admitting the possibility that children with high capacity have a probability less than 1 to draw a certain HFD detail (item). This is often observed in HFD. We performed IRT model three times using 1, 2 and 4 parameter models and compared their model fit indexes. The latent trait correlations estimated by these three models were very high (r=0.98), suggesting that children's abilities did not change substantially when using the 4-parameter model. It is pointed out a limitation in the correct way of modeling test item dimensionality considering that there is a hierarchical structure among items.Keywords: Four parameter Item Response Theory; optimal scoring; psychometrics; intelligence assessment.

RESUMO – Usando a Teoria de Resposta ao Item de quatro parâmetros no Desenho da Figura HumanaEsse artigo estudou a aplicação de uma versão estendida da Teoria de Resposta de Item, o modelo de 4 parâmetros (4PL), na análise dos itens do Desenho de Figura Humana (DFH). O DFH pontua detalhes do desenho como indicadores do desenvolvimento cognitivo. Esse modelo incorpora um parâmetro assintótico superior (parâmetro d) admitindo a possibilidade de que crianças com alta capacidade tenham probabilidade menor que 1 de desenhar determinado detalhe (item) do DFH. Isso é um evento comum no DFH. Executamos o modelo de TRI três vezes, usando modelos de 1, 2 e 4 parâmetros e comparamos seus índices de ajuste. As correlações dos traços latentes estimados por esses três modelos são muito altas (r=0,98), sugerindo que as habilidades das crianças não mudaram substancialmente ao usarmos o modelo de quatro parâmetros. Aponta-se uma limitação na maneira correta de modelar a dimensionalidade dos itens do teste considerando que há uma estrutura hierárquica dos itens.Palavras-chave: teoria de resposta ao item de quatro parâmetros; escores ótimos; psicometria; avaliação da inteligência.

RESUMEN – Uso de la Teoría de la Respuesta al Ítem de cuatro parámetros en el Dibujo de la Figura HumanaEste artículo estudió la aplicación de una versión extendida de la Teoría de la Respuesta al Ítem, el modelo de 4 parámetros (4PL), en el análisis de los ítems del Dibujo de la Figura Humana (DFH). El DFH puntúa los detalles del diseño como indicadores del desarrollo cognitivo. Este modelo incorpora un parámetro asintótico superior (parámetro d) admitiendo la posibilidad de que niños con alta capacidad tengan probabilidad menor que 1 de dibujar determinado detalle (ítem) del DFH. Esto es un evento común en el DFH. Se ejecutó el modelo TRI tres veces, usando modelos de 1, 2 y 4 parámetros y se comparó sus índices de ajuste. Las correlaciones de los rasgos latentes estimados por estos tres modelos son muy altas (r=0,98), sugiriendo que las habilidades de los niños no cambiaron sustancialmente al usar el modelo de 4 parámetros. Se apunta una limitación en la manera correcta de modelar la dimensionalidad de los ítems de la prueba, considerando que hay una estructura jerárquica de los ítems.Palabras clave: Teoría de la respuesta al ítem de 4 parámetros; escores óptimos; psicometría; evaluación de la inteligencia.

Using Four-ParameterItem Response Theory to model

Human Figure Drawings

Ricardo Primi1

Universidade São Francisco, Campus Swift, Campinas-SP, Brasil

Tatiana de Cassia Nakano, Solange Muglia WechslerPontifícia Universidade Católica de Campinas, Campinas-SP, Brasil

1 Endereço para correspondência: Universidade São Francisco. Rua Waldemar César da Silveira, 105, 13045-510, Campinas, SP. E-mail: [email protected]

ARTIGO – DOI: http://dx.doi.org/10.15689/ap.2018.1704.7.07

Ever since the 19th century, research has revealed empirical interest in children drawing abilities. In 1926, Goodenough developed a test to assess children’s in-tellectual development from human figure drawings (HFD). The method is based on findings of that gen-eral developmental stages exist, and that they impact

drawings performed by children. Since then this method has been widely employed (Cronin, Gross, & Hayne, 2017). At the age of four, for example, authors argue that while some children can only make scribbles, other pro-duce drawings in form of tadpoles, while others elabo-rate human figures with head, torso and four separated

Page 16: Modelos TRI de 2 e 3 parâmetros - LabAPE

Reprinted with permission.JOURNAL OF EDUCATIONAL MEASUREMENTVolume 14, No.2 Summer 1977

RB-77-3

PRACTICAL APPLICATIONS OF ITEM CHARACTERISTICCURVE THEORv-FREDERIC M. LORDEducational Testing Service

Much of classical test theory deals with an entire test; this theory is applicable evenif the test is not composed of items. If a test consists of separate items and if test scoreis a (possibly weighted) sum of item scores, then statistics describing the test scores of acertain group of examinees can be expressed algebraically in terms of statistics describ-ing the individual item scores for the same group of examinees. Insofar as it relates totests, classical item theory (this is only a part of classical test theory) consists of suchalgebraic tautologies. Such a theory makes no assumptions about matters that arebeyond the control of the psychometrician. This is actuarial science. It cannot predicthow individuals will respond to items unless the items have previously been adminis-tered to similar individuals.In practical test development work, we often need to predict the statistical properties

of a test composed of items for a target group of examinees that is somewhat differentfrom the groups to which the separate items have previously been administered. Weneed to be able to describe the items by using item parameters, and the examinees byusing examinee parameters, in such a way that we can predict probabilistjcally theresponse of any examinee to any item, even if similar examinees have never. takensimilar items before.This involves making predictions about things beyond the control of the psycho-

metrician-about how people will behave in the real world. Thus some assumptionmust be made (and verified) as to how a person's ability or skill determines hir per-formance on items measuring that ability or skill. Item response theory (item char-acteristic curve theory) provides these assumptions.

ESTIMATING THE STATISTICAL CHARACTERISTICS OF ATEST FOR ANY SPECIFIED GROUP

How can item response theory be applied? It is sometimes asserted that item re-sponse theory allows us to answer any question that we are entitled to ask about thecharacteristics predicted for a test composed of items with known item parameters.The significance of this vague statement arises from the fact that item response theoryprovides us with the frequency distribution f of test score for examinees having a spec-ified level (J of ability or skill.Except where otherwise noted, consideration here will be limited to the number-

right score, denoted by x. Suppose the n items in a test all had identical item responsecurves P ;: P«(J) (item characteristic curves, see Hambleton & Cook (1977) for a de-tailed definition). The distribution of x for a person at ability level (J would then be

the familiar binominal distribution f(x I8) = (:) p'on-x, where 0 == I - P. The

·This paper includes portions of the writer's 1976 presidential address to Division 5 o(APA, titled"Applica-tions of item response theory to practical testing problems." Also portions of the. w'il'ter's talk, titled "Testtheory in the public interest," at the 1976 ETS Invitational Conference.

117

• Optimum scoring weights

PRACTICAL APPLICATIONS OF ICC THEORY 123

Curve 8 shows the efficiency of a hypothetical 'peaked' test composed of items thatdiffer from the actual test items only in difficulty, b.. For curve 8, all the b, are equal.The actual SAT Verbal test differs from the peaked test because the actual test con-tains--mat:lY-easy--itetm{fur-the..benetiLof.the.low, abiIifY Sf JI den f s) and many.hard.items.(for the high ability students).

OPTIMAL SCORING WEIGHTS FOR ITEMS

If items are scored 0 or I, as assumed throughout this paper unless otherwise spec-ified, and if test score y is a weighted sum of item scores, then the score informationfunction given by Equation (8) has the form given by Hambleton and Cook (1977,Equation (I). If optimal weights (Hambleton & Cook, Equation (4» arc used, thenthe score information function is found to be identical with the maximal informationfunction 1(0,0), given here as Equation (9). This means that an optimally weightedcomposite score is asymptotically as good a measure of ability as is the maximum likeli-hood estimator 8.The optimal item-scoring weight is a function of examinee ability. The functions

wj(O) for the five items of Hambleton's Figure 2 are shown in Figure 3 (reproducedfrom Lord, 1968, by permission of Educational and Psychological Measurement). Themain point to note here is that the items that should be heavily weighted for measuringhigh-ability examinees are not necessarily the items that should be heavily weightedfor measuring low-ability examinees.For measuring high-ability examinees under the three-parameter logistic model, the

optimal scoring weight for most items (for items that are easy for such examinees) isproportional to the item discriminating power a.. When measuring low-ability ex-aminees, difficult items should receive ncar-zero scoring weight, regardless of their a,parameter (sec the discussion of curve 7, Figure 2 in the preceding section).Item-scoring weights that are optimal for a particular examinee can never be deter-

mined exactly, since we do not know the examinee's ability 0 exactly. For the logisticmodel the optimal weight is

Da: P(O) - c.Wj(O) = .__'__1 1 •

1 - Ci Pj(O)(10)

A crude procedure for obtaining item-scoring weights is to substitute the conventionalitem difficulty Pi (proportion of correct answers in the total group of examinees) forPi(O) in Equation (10). A crude procedure would use the resulting weight for scoringitem i on all answer sheets regardless of examinee ability level. Since D = 1.7 is aconstant, we can drop it and use the weight

w. = _a_i _ Pi - Cj

I I - c, PI . (II)

This same item-scoring weight, except for the a., was recommended on other groundsby Chernoff (see Lord & Novick, 1968, p. 310).If there is no guessing, c, = 0 and the item scoring weight is equal to a., the item

discriminating power. If there is guessing, Equation (II) gives low scoring weight todifficult items. The justification is that a correct response to a difficult item may indi-cate lucky guessing rather than high ability.

Para D =1

1p: wi = 12p: wi = ai

Page 17: Modelos TRI de 2 e 3 parâmetros - LabAPE

http://www.labape.com.br/rprimi/R/3pl_ENEM.html

Page 18: Modelos TRI de 2 e 3 parâmetros - LabAPE

transformations are carried out. To get an idea of the overall discriminating powerof a test instrument, the test reliability index is a better indicator.

A Note on the Parameterisation of Item DifficultyParameters Under 2PL Model

In Eq. (10.2), the numerator, exp a h! dð Þð Þ, expresses the slope parameter as amultiplier of h! dð Þ. The argument of the exponential function can also beexpanded as ah! adð Þ. In some software packages, the item difficulty parameterreported for 2PL is ad rather than d. Check software documentations for theparameterisation, as different parameterisations can lead to different interpretationsof the item difficulty parameters.

Impact of Different Item Weights on Ability Estimates

Under the Rasch model, raw scores on a test are “sufficient statistics” for abilityestimates (see Chap. 7). That is, students with the same raw score on a test will havethe same ability estimate, irrespective of which items they answered correctly, sinceall items have the same weight in the test. Under the 2PL model, students with thesame raw score may not necessarily have the same ability estimate; it will dependon the particular set of items a student answered correctly. If the items answeredcorrectly have more weights, then the ability will be higher. This makes sense inthat if an item does not discriminate students (say, the responses are random guessesfor that item), then obtaining the correct answer on this item does not indicate amore able student. So this item “counts” less towards the ability estimate. Table 10.4 shows weighted likelihood ability estimates (WLE) for selected students from theexample data set.

The ability estimates from 2PL models are likely to be closer to students’ “true”abilities than Rasch ability estimates are, since the estimation takes into account theamount of “information” provided by each item. However, providing differentability estimates for the same raw score may pose a problem for examinationofficials who may need to explain to the layperson how ability estimates arederived. In providing such explanations, it is inevitable to acknowledge that itemsare of varying “quality” in the test. For high-stake examinations, this issue needs tobe considered.

Table 10.4 Ability estimatesfor selected students with araw score of 10 out of 13

Student id Rasch WLE ability 2PL WLE ability15 1.20 0.8817 1.20 1.1618 1.20 1.26

196 10 Two-Parameter IRT Models

Page 19: Modelos TRI de 2 e 3 parâmetros - LabAPE
Page 20: Modelos TRI de 2 e 3 parâmetros - LabAPE

Modelo de 3 parâmetros

P c c eei i i

Da b

Da b

i i

i i( ) ( )

( )

( )qq

q= + -+

-

-11

Page 21: Modelos TRI de 2 e 3 parâmetros - LabAPE

Fácil Difícil1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 Escore = 5/15

Escore = 5/15

Os estudantes terão notas iguais ?Quem terá nota maior ? Porque ?

ENEMem2015ProvadeMatemá3ca(provasazuleamarela)549.253(9,8%dototaldepar3cipantes)Éumaamostradiversacompessoasde5.029municípiosde28estadosbrasileiros.

Page 22: Modelos TRI de 2 e 3 parâmetros - LabAPE
Page 23: Modelos TRI de 2 e 3 parâmetros - LabAPE
Page 24: Modelos TRI de 2 e 3 parâmetros - LabAPE
Page 25: Modelos TRI de 2 e 3 parâmetros - LabAPE

suj mt43

mt45

mt3

mt6

mt9

mt8

mt19

mt38

mt37

mt34

mt42

mt13

mt29

mt26

mt39

mt15

mt24

mt28

mt16

mt27

mt25

mt18

mt21

mt2

mt5

mt44

mt33

mt12

mt36

mt7

mt1

mt31

mt17

mt32

mt20

mt4

mt11

mt35

mt14

mt23

mt10

mt22

mt41

mt40

mt30

mt_scores

acrt_dif

NU_N

OTA_MT

b -0.36 -0.26 -0.04 0.75 0.79 0.83 0.91 1.13 1.22 1.24 1.38 1.52 1.61 1.70 1.71 1.72 1.75 1.78 1.80 1.85 1.88 1.92 1.94 1.99 2.02 2.08 2.13 2.15 2.19 2.29 2.33 2.36 2.42 2.43 2.53 2.53 2.62 2.62 3.14 3.82 4.19 4.42 4.53 5.40 6.92iD 0.52 0.52 0.56 0.34 0.33 0.36 0.38 0.34 0.39 0.26 0.32 0.26 0.20 0.33 0.20 0.33 0.16 0.26 0.37 0.28 0.17 0.21 0.24 0.19 0.26 0.20 0.20 0.23 0.24 0.21 0.22 0.12 0.29 0.21 0.14 0.28 0.19 0.21 0.11 0.11 0.12 0.24 0.17 0.21 0.141 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 5622 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 5773 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 5874 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 10 3% 5705 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 5756 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 10 3% 5787 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 10 3% 5838 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 5789 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 10 3% 56510 1 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 3% 57111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 10 29% 32212 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 10 29% 33713 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 10 29% 33114 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 10 29% 33815 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 10 29% 32516 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 10 29% 33717 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 10 29% 33018 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 10 29% 32519 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 10 29% 33620 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 10 29% 335

Page 26: Modelos TRI de 2 e 3 parâmetros - LabAPE

PRACTICAL APPLICATIONS OF ICC THEORY 123

Curve 8 shows the efficiency of a hypothetical 'peaked' test composed of items thatdiffer from the actual test items only in difficulty, b.. For curve 8, all the b, are equal.The actual SAT Verbal test differs from the peaked test because the actual test con-tains--mat:lY-easy--itetm{fur-the..benetiLof.the.low, abiIifY Sf JI den f s) and many.hard.items.(for the high ability students).

OPTIMAL SCORING WEIGHTS FOR ITEMS

If items are scored 0 or I, as assumed throughout this paper unless otherwise spec-ified, and if test score y is a weighted sum of item scores, then the score informationfunction given by Equation (8) has the form given by Hambleton and Cook (1977,Equation (I). If optimal weights (Hambleton & Cook, Equation (4» arc used, thenthe score information function is found to be identical with the maximal informationfunction 1(0,0), given here as Equation (9). This means that an optimally weightedcomposite score is asymptotically as good a measure of ability as is the maximum likeli-hood estimator 8.The optimal item-scoring weight is a function of examinee ability. The functions

wj(O) for the five items of Hambleton's Figure 2 are shown in Figure 3 (reproducedfrom Lord, 1968, by permission of Educational and Psychological Measurement). Themain point to note here is that the items that should be heavily weighted for measuringhigh-ability examinees are not necessarily the items that should be heavily weightedfor measuring low-ability examinees.For measuring high-ability examinees under the three-parameter logistic model, the

optimal scoring weight for most items (for items that are easy for such examinees) isproportional to the item discriminating power a.. When measuring low-ability ex-aminees, difficult items should receive ncar-zero scoring weight, regardless of their a,parameter (sec the discussion of curve 7, Figure 2 in the preceding section).Item-scoring weights that are optimal for a particular examinee can never be deter-

mined exactly, since we do not know the examinee's ability 0 exactly. For the logisticmodel the optimal weight is

Da: P(O) - c.Wj(O) = .__'__1 1 •

1 - Ci Pj(O)(10)

A crude procedure for obtaining item-scoring weights is to substitute the conventionalitem difficulty Pi (proportion of correct answers in the total group of examinees) forPi(O) in Equation (10). A crude procedure would use the resulting weight for scoringitem i on all answer sheets regardless of examinee ability level. Since D = 1.7 is aconstant, we can drop it and use the weight

w. = _a_i _ Pi - Cj

I I - c, PI . (II)

This same item-scoring weight, except for the a., was recommended on other groundsby Chernoff (see Lord & Novick, 1968, p. 310).If there is no guessing, c, = 0 and the item scoring weight is equal to a., the item

discriminating power. If there is guessing, Equation (II) gives low scoring weight todifficult items. The justification is that a correct response to a difficult item may indi-cate lucky guessing rather than high ability.

Page 27: Modelos TRI de 2 e 3 parâmetros - LabAPE

Entenda a sua Nota no EnemGuia do Participante

18

NOTA: Quando dizemos que o participante acertou uma questão “no chute”, não significa que sua nota irá diminuir, mas ela não tem tanto valor como se o participante tivesse acertado os itens com a coerência pedagógica esperada. Então, sempre é melhor responder à questão do que a deixar em branco, pois uma questão certa sempre aumenta a nota, e uma questão deixada em branco é corrigida como errada.

200

300

400

600

700

800

Difícil

Fácil

Nota480

Nota310

Par

tici

pan

teA

Par

tici

pan

teB

500

A coerência pedagógicaesperada é que o participanteacerte as questões que estãoabaixo de seu nível de proficiência.Se a proficiência do Participante B

alta, a probabilidade de acertoitens fáceis seria grande. Todavia,

ele errou os itens fáceis, então suaproficiência não deve ser alta.

fossedos

Coerência pedagógica .... -> undimensionalidade essencial

... se o par6cipante acerta a questão de um nível acima, que u%liza conhecimentos de um nível abaixo, podemos inferir que não existe coerência nas respostas e, ..... entende-se, .. que foram acertadas “no chute”. Esse controle é devido ao parâmetro de acerto casual.

Qual a evidência de que a prova é essencialmente unidimensional ?

E se exis6rem fatores de grupo ?

Imagine itens de algebra, geometria, análise combinatória e probabilidade?

Page 28: Modelos TRI de 2 e 3 parâmetros - LabAPE

Exercício 4: Aplicando modelo de 2 e 3 parâmetros no ENEM