maximum information per unit time in adaptive testing
DESCRIPTION
Maximum Information per Unit Time in Adaptive Testing. Ying (“Alison”) Cheng 1 John Behrens 2 Qi Diao 3 1 Lab of Educational and Psychological Measurement Department of Psychology, University of Notre Dame 2 Center for Digital Data, Analytics, & Adaptive Learning Pearson 3 CTB/McGraw-Hill. - PowerPoint PPT PresentationTRANSCRIPT
1
Yi n g ( “ A l i s o n ” ) C h e n g 1 J o h n B e h r e n s 2
Q i D i a o 3
1 L a b o f E d u c a t i o n a l a n d P s y c h o l o g i c a l M e a s u r e m e n tD e p a r t m e n t o f P s y c h o l o g y, U n i v e r s i t y o f N o t r e D a m e
2 C e n t e r f o r D i g i t a l D a t a , A n a l y t i c s , & A d a p t i v e L e a r n i n gP e a r s o n
3 C T B / M c G r a w - H i l l
Maximum Information per Unit Time in Adaptive Testing
2
Test Efficiency
Weiss (1982): CAT can achieve the same measurement
precision with half the number of items of linear tests when the maximum information (MI) method is used for item selection
Maximum information method (Lord, 1980) Choosing the item that yields the largest amount of
information at the most recent ability estimate Maximum information Per Item
3
Test Efficiency
All tests are timed
Maximum information given a time limit Choosing the item that yields the largest ratio of
amount of information and time required Maximum information per unit time (MIPUT) (Fan,
Wang, Chang, & Douglas, 2013)
4
MI vs. MIPUT
MI: ,
where the eligible set of items after t items have been administered, and is the information of item l evaluated at MIPUT:
where denominator is the expected time required to finish item l given the working speed of the examinee, .
5
Implementation of MIPUT
log-normal model for response time (van der Linden, 2006):
So , (time intensity) and (working speed) can be estimated from response time data
(+)
6
Performance of MIPUT
Fan et al. (2013) showed that the MIPUT method when compared to the MI method leads to:
i) shorter testing time;ii) small loss of measurement precision; iii) visibly worse item pool usage.
Fan et al. (2013) used a-stratification (Chang & Ying, 1999) with the MIPUT method to balance item pool usage and found it effective
7
a-stratification
Item information: .Items with high discrimination parameter are
over-used under the MIa-stratification restricts item selection to low-
a items early in the test, and high-a items later
Apparently high-a items are still over-used under MIPUT
That’s why a-stratification helps balance item usage under MIPUT
8
Questions that Remain
Fan et al. (2013) simulated items that: Item difficulty and time intensity are either correlated or
not correlated; Item discrimination and difficulty are not correlated; Item discrimination and time intensity are not correlated.
In reality: Item discrimination and difficulty are positively
correlated (~.4-.6) (Chang, Qian, Ying, 2001).
Q1: How about item discrimination and time intensity?
9
Follow-Up Questions
Q2: If item discrimination and time intensity are indeed related: Will MIPUT still lead to worse item pool usage than MI? If so, is that still due to highly discrimination items or
due to highly time saving items?
Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than MI? If so, is that due to highly time saving items? If so, how can we control item exposure?
10
Calibration of a large item bank Online math testing data 595 items Over 2 million entries of testing data 3PL and 2PL model – in the following analysis, focus
on 2PL Time intensity measured by the log-transformed
average time on each item
Q1: Item Discrimination and Time Intensity
11
2PL_a 2PL_b 3PL_a 3PL_b 3PL_cTime
Intensity2PL_a 1 .111** .702** .009 -.584** .1392PL_b .111** 1 .387** .935** -.369** .5623PL_a .387** .702** 1 .350** -.363** .2053PL_b .935** .009 .350** 1 -.226** .5643PL_c -.369** -.584** -.363** -.226** 1 -.0425Time Intensity
.522** .080 .153** .522** -.355** 1
Q1: Item Discrimination and Time Intensity
12
Q2
So item discrimination and time intensity are indeed related. Then Will MIPUT still lead to worse item pool usage than
MI? If so, is that still due to highly discrimination items or
due to highly time saving items?
13
A Simplified Version of MIPUT
where denominator is the average time required to finish item l is not individualizedMay be more robust against violation to model
assumptions
14
CAT simulation Test length: 20 or 40 First item randomly chosen from the pool 5,000 test takers ~ N(0,1) Ability update: EAP with prior of N(0,1) No exposure control or content balancing if not
specified otherwise
Simulation Details
15
Q2 20-Item 40-Item
MI_2PL MIPUT_2PL MI_2PL MIPUT_2PLBias .002 .003 0.001 .002MSE .019 .020 0.012 .012
.991 .990 0.994 .994Chi-square 95.29 96.41 81.00 84.10No exposure 73.1% 71.9% 50.9% 52.1%Underexpose
d (<.02) 77.8% 78.2% 57.0% 58.2%
Overexposed (>.20) 4.37% 5.04% 11.6% 11.8%
Average time used (mins) 38.596 34.434 79.346 70.112
Min testing time) 17.857 16.374 37.039 34.383
Max testing time 84.035 79.969 79.346 146.773
16
Findings
On average, MIPUT leads to shorter tests (on average by 4 minutes than MI if test length is 20 – 10%, and 9 minutes if test length is 40 – 11%)
MIPUT leads to slightly worse exposure controlWhen item discrimination and time intensity are
positively related, the disadvantage of MIPUT in exposure control becomes less conspicuous
MI and MIPUT lead to negligible difference in measurement precision
Over-exposure is still largely attributable to highly discrimination items
17
Q3
Q3: Under the 1PL model where item discrimination parameter is not a factor Will MIPUT still lead to worse item pool usage than
MI? If so, is that due to highly time saving items? If so, how can we control item exposure?
18
MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL
Bias -.001 .001 .007 -.004
MSE .074 .078 .081 .075
.963 .962 .960 .963
Chi-square 23.31 152.67 25.59 90.21
No exposure 26.1% 78.3% 36.1% 43.2%
Underexposed (<.02)
59.2% 82.5% 56.0% 75.8%
Overexposed (>.20)
0 5.38% 0 4.03%
Average time used
(mins)
38.909 17.594 26.643 20.023
Min testing time
17.909 11.547 16.079 12.388
Max testing time
94.407 53.557 73.127 63.044
Test Length = 20
19
Findings if Test Length = 20
MI vs MIPUT Negligible difference in measurement precision MIPUT reduces testing time by 21 minutes for a
20-item test (55% reduction)But MIPUT leads to much worse exposure control Items that are highly time saving are favored
o Correlation between the exposure rate and time intensity under MI-1PL: -.240 – an artifact of the item bank
o Correlation between the exposure rate and time intensity under MIPUT-1PL: -.398
20
Exposure Control
a-stratification is not going to work Randomesque (Kingsbury & Zara, 1989)
Randomly choose one out of n best items, e.g., n = 5 MIPUT-R5
Progressive Restricted (Revuelta & Ponsoda, 1998) A weighted index, weight determined by the stage of the
test Random number and the time-adjusted item information Higher weight given to the time-adjusted item information
later in the test
21
MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL
Bias -.001 .001 .007 -.004
MSE .074 .078 .081 .075
.963 .962 .960 .963
Chi-square 23.31 152.67 25.59 90.21
No exposure 26.1% 78.3% 36.1% 43.2%
Underexposed (<.02)
59.2% 82.5% 56.0% 75.8%
Overexposed (>.20)
0 5.38% 0 4.03%
Average time used
(mins)
38.909 17.594 26.643 20.023
Min testing time
17.909 11.547 16.079 12.388
Max testing time
94.407 53.557 73.127 63.044
Test Length = 20
22
Findings if Test Length = 20
MIPUT_R5 Maintains measurement precision Much better exposure control Reduces testing time on average by 12 minutes
(>30% reduction)
MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 18 minutes
(reduction almost by half)
23
MI_1PL MIPUT_1PL MIPUTR5_1PL MIPUTPR_1PL
Bias -.003 -.001 .003 .004
MSE .038 .040 .045 .040
.981 .981 .978 .981
Chi-square 30.75 135.37 17.73 99.31
No exposure 3.4% 60.7% 7.7% 15.6%
Underexposed (<.02)
25.7% 66.9% 18.3% 59.2%
Overexposed (>.20)
3.9% 13.4% 0 12.3%
Average time used
(mins)
77.660 41.192 67.903 45.223
Min testing time
39.451 27.300 48.718 29.181
Max testing time
162.889 132.986 136.585 137.489
Test Length = 40
24
Findings if Test Length = 40
Same findings replicated when test length doublesMIPUT leads to much worse item pool usage
because of the overreliance on time saving itemsMIPUT_R5
Maintains measurement precision Much better exposure control Reduces testing time on average by 13%
MIPUT_PR Maintains measurement precision Better exposure control but still not quite so good Reduces testing time on average by 41%
25
Overall Summary
MIPUT’s advantage of time saving is more conspicuous under the 1PL
MIPUT leads to much worse item pool usage than MI and relies heavily on time saving items
MIPUT_R5 is a promising method to maintain measurement precision, balance item pool usage and still keeps the time saving advantage
26
Future Directions
Develop a parallel exposure control method under MIPUT to a-stratify: stratifying by time
Investigates the performance of the simplified MIPUT and the original MIPUT in the presence of violation of assumptions to the log-normal model for response time
More data analysis to explore the relationship between time intensity and item parameters
Control total testing time (van der Linden & Xiong, 2013)
27
Thank You!
CTB/McGraw-Hill 2014 R&D GrantQuestion or paper, please visit
irtnd.wikispaces.com