maximum information per unit time in adaptive testing

1

Yi n g ( “ A l i s o n ” ) C h e n g 1 J o h n B e h r e n s 2

Q i D i a o 3

1 L a b o f E d u c a t i o n a l a n d P s y c h o l o g i c a l M e a s u r e m e n tD e p a r t m e n t o f P s y c h o l o g y, U n i v e r s i t y o f N o t r e D a m e

2 C e n t e r f o r D i g i t a l D a t a , A n a l y t i c s , & A d a p t i v e L e a r n i n gP e a r s o n

3 C T B / M c G r a w - H i l l

Maximum Information per Unit Time in Adaptive Testing

2

Test Efficiency

Weiss (1982): CAT can achieve the same measurement

precision with half the number of items of linear tests when the maximum information (MI) method is used for item selection

Maximum information method (Lord, 1980) Choosing the item that yields the largest amount of

information at the most recent ability estimate Maximum information Per Item

3

Test Efficiency

All tests are timed

Maximum information given a time limit Choosing the item that yields the largest ratio of

amount of information and time required Maximum information per unit time (MIPUT) (Fan,

Wang, Chang, & Douglas, 2013)

4

MI vs. MIPUT

MI: ,

where the eligible set of items after t items have been administered, and is the information of item l evaluated at MIPUT:

where denominator is the expected time required to finish item l given the working speed of the examinee, .

5

Implementation of MIPUT

log-normal model for response time (van der Linden, 2006):

So , (time intensity) and (working speed) can be estimated from response time data

(+)

6

Performance of MIPUT

Fan et al. (2013) showed that the MIPUT method when compared to the MI method leads to:

i) shorter testing time;ii) small loss of measurement precision; iii) visibly worse item pool usage.

Fan et al. (2013) used a-stratification (Chang & Ying, 1999) with the MIPUT method to balance item pool usage and found it effective

7

a-stratification

Item information: .Items with high discrimination parameter are

over-used under the MIa-stratification restricts item selection to low-

a items early in the test, and high-a items later

Apparently high-a items are still over-used under MIPUT

That’s why a-stratification helps balance item usage under MIPUT

8

Questions that Remain

Fan et al. (2013) simulated items that: Item difficulty and time intensity are either correlated or

not correlated; Item discrimination and difficulty are not correlated; Item discrimination and time intensity are not correlated.

In reality: Item discrimination and difficulty are positively

correlated (~.4-.6) (Chang, Qian, Ying, 2001).

Q1: How about item discrimination and time intensity?

9

Follow-Up Questions

Q2: If item discrimination and time intensity are indeed related: Will MIPUT still lead to worse item pool usage than MI? If so, is that still due to highly discrimination items or

due to highly time saving items?

Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than MI? If so, is that due to highly time saving items? If so, how can we control item exposure?

10

Calibration of a large item bank Online math testing data 595 items Over 2 million entries of testing data 3PL and 2PL model – in the following analysis, focus

on 2PL Time intensity measured by the log-transformed

average time on each item

Q1: Item Discrimination and Time Intensity

11

2PL_a 2PL_b 3PL_a 3PL_b 3PL_cTime

Intensity2PL_a 1 .111** .702** .009 -.584** .1392PL_b .111** 1 .387** .935** -.369** .5623PL_a .387** .702** 1 .350** -.363** .2053PL_b .935** .009 .350** 1 -.226** .5643PL_c -.369** -.584** -.363** -.226** 1 -.0425Time Intensity

.522** .080 .153** .522** -.355** 1

Q1: Item Discrimination and Time Intensity

12

Q2

So item discrimination and time intensity are indeed related. Then Will MIPUT still lead to worse item pool usage than

MI? If so, is that still due to highly discrimination items or

due to highly time saving items?

13

A Simplified Version of MIPUT

where denominator is the average time required to finish item l is not individualizedMay be more robust against violation to model

assumptions

14

CAT simulation Test length: 20 or 40 First item randomly chosen from the pool 5,000 test takers ~ N(0,1) Ability update: EAP with prior of N(0,1) No exposure control or content balancing if not

specified otherwise

Simulation Details

15

Q2 20-Item 40-Item

MI_2PL MIPUT_2PL MI_2PL MIPUT_2PLBias .002 .003 0.001 .002MSE .019 .020 0.012 .012

.991 .990 0.994 .994Chi-square 95.29 96.41 81.00 84.10No exposure 73.1% 71.9% 50.9% 52.1%Underexpose

d (<.02) 77.8% 78.2% 57.0% 58.2%

Overexposed (>.20) 4.37% 5.04% 11.6% 11.8%

Average time used (mins) 38.596 34.434 79.346 70.112

Min testing time) 17.857 16.374 37.039 34.383

Max testing time 84.035 79.969 79.346 146.773

16

Findings

On average, MIPUT leads to shorter tests (on average by 4 minutes than MI if test length is 20 – 10%, and 9 minutes if test length is 40 – 11%)

MIPUT leads to slightly worse exposure controlWhen item discrimination and time intensity are

positively related, the disadvantage of MIPUT in exposure control becomes less conspicuous

MI and MIPUT lead to negligible difference in measurement precision

Over-exposure is still largely attributable to highly discrimination items

17

Q3

Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than

MI? If so, is that due to highly time saving items? If so, how can we control item exposure?

18

MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL

Bias -.001 .001 .007 -.004

MSE .074 .078 .081 .075

.963 .962 .960 .963

Chi-square 23.31 152.67 25.59 90.21

No exposure 26.1% 78.3% 36.1% 43.2%

Underexposed (<.02)

59.2% 82.5% 56.0% 75.8%

Overexposed (>.20)

0 5.38% 0 4.03%

Average time used

(mins)

38.909 17.594 26.643 20.023

Min testing time

17.909 11.547 16.079 12.388

Max testing time

94.407 53.557 73.127 63.044

Test Length = 20

19

Findings if Test Length = 20

MI vs MIPUT Negligible difference in measurement precision MIPUT reduces testing time by 21 minutes for a

20-item test (55% reduction)But MIPUT leads to much worse exposure control Items that are highly time saving are favored

o Correlation between the exposure rate and time intensity under MI-1PL: -.240 – an artifact of the item bank

o Correlation between the exposure rate and time intensity under MIPUT-1PL: -.398

20

Exposure Control

a-stratification is not going to work Randomesque (Kingsbury & Zara, 1989)

Randomly choose one out of n best items, e.g., n = 5 MIPUT-R5

Progressive Restricted (Revuelta & Ponsoda, 1998) A weighted index, weight determined by the stage of the

test Random number and the time-adjusted item information Higher weight given to the time-adjusted item information

later in the test

21


Bias -.001 .001 .007 -.004

MSE .074 .078 .081 .075

.963 .962 .960 .963

Chi-square 23.31 152.67 25.59 90.21

No exposure 26.1% 78.3% 36.1% 43.2%

Underexposed (<.02)

59.2% 82.5% 56.0% 75.8%

Overexposed (>.20)

0 5.38% 0 4.03%

Average time used

(mins)

38.909 17.594 26.643 20.023

Min testing time

17.909 11.547 16.079 12.388

Max testing time

94.407 53.557 73.127 63.044

Test Length = 20

22


MIPUT_R5 Maintains measurement precision Much better exposure control Reduces testing time on average by 12 minutes

(>30% reduction)

MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 18 minutes

(reduction almost by half)

23


Bias -.003 -.001 .003 .004

MSE .038 .040 .045 .040

.981 .981 .978 .981

Chi-square 30.75 135.37 17.73 99.31

No exposure 3.4% 60.7% 7.7% 15.6%

Underexposed (<.02)

25.7% 66.9% 18.3% 59.2%

Overexposed (>.20)

3.9% 13.4% 0 12.3%

Average time used

(mins)

77.660 41.192 67.903 45.223

Min testing time

39.451 27.300 48.718 29.181

Max testing time

162.889 132.986 136.585 137.489

Test Length = 40

24


Same findings replicated when test length doublesMIPUT leads to much worse item pool usage

because of the overreliance on time saving itemsMIPUT_R5

Maintains measurement precision Much better exposure control Reduces testing time on average by 13%

MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 41%

25

Overall Summary

MIPUT’s advantage of time saving is more conspicuous under the 1PL

MIPUT leads to much worse item pool usage than MI and relies heavily on time saving items

MIPUT_R5 is a promising method to maintain measurement precision, balance item pool usage and still keeps the time saving advantage

26

Future Directions

Develop a parallel exposure control method under MIPUT to a-stratify: stratifying by time

Investigates the performance of the simplified MIPUT and the original MIPUT in the presence of violation of assumptions to the log-normal model for response time

More data analysis to explore the relationship between time intensity and item parameters

Control total testing time (van der Linden & Xiong, 2013)

27

Thank You!

CTB/McGraw-Hill 2014 R&D GrantQuestion or paper, please visit

irtnd.wikispaces.com

maximum information per unit time in adaptive testing

Documents