multiple perspectives on cat for k-12 assessments: possibilities and realities alan nicewander...

21
Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Upload: cornelius-miller

Post on 04-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Multiple Perspectives on CATfor K-12 Assessments:

Possibilities and Realities

Alan NicewanderPacific Metrics

1

Page 2: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• The following are some putative advantages of CAT relative to paper-based tests, and so-called linear tests delivered by computer. Some of these are listed below with comments:

CAT allows a pool of items to be used for on-demand testing of students and/or multiple testing of the same student…AND, at the same time, preserves the security of the item bank.

CAT presents test items at a level appropriate for a student’s ability level.

It should also be mentioned that CAT eliminates test booklets that can be stolen or lost--thereby, compromising test security.

2

Page 3: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• CATs are significantly shorter, and have higher measurement efficiency than linear tests. They are shorter because time is not wasted by:– Presenting low proficiency students difficult items leads to

many incorrect responses from which little is learned about student proficiency.

– Presenting highly proficient students with items that are so easy that the extent of their knowledge is not revealed.

– Even though CATs are shorter than linear tests, they are capable of increasing the reliability of measurement in the extremes of the proficiency distribution relative to linear tests of greater length.

3

Page 4: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Comments

• Any type of on-demand or repeated testing carries the risk of item exposure.

• A crucial variable for increasing the risk of item exposure during on demand, repetitive testing is the degree to which the test is high-stakes; the higher the stakes, the greater the pressure on the item pool for exposure.

• The prime example here is the CAT-GRE, which was abandoned partly because of item security issues.

4

Page 5: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• To insure reasonable levels of CAT security, two methods have been found to be most effective in simulations:

– Stochastic exposure control using the Sympson-Hetter

method (or a similar method).

– Increasing the number of items in the pool.

• These findings are from initial R&D done for development of the CAT-ASVAB.

5

Page 6: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Comments

• It is true that CATs can be considerably shorter in length. For example, the CAT-ASVAB is 1/3 shorter than the paper-based version (129 vs. 200 items), and the reliability coefficients run about 15% higher.

– However, the CAT-ASVAB has moderate exposure control and very little content balancing imposed on optimum item selection.

– Increasing the levels of exposure-and-content controls can lead to longer test lengths and BATs (barely adaptive tests).

– Increased levels of exposure control and content balancing lead to longer tests with lower reliability. 6

Page 7: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• Existing test forms can be used to produce item pools for CAT.

7

Page 8: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Comments

• CAT item pool development can be a daunting task. As an illustration, suppose a current, paper-based testing program is administered with three forms of a 50-item test. [Note that the item exposure rate for the current procedure is 1/3 (each time a test is given, 1/3 of the total collection of items is exposed).]

• If it assumed that a CAT system can reduce test length to 35 items, how many items need to be developed to form the pools needed?

• A general rule is to have pool size five times the length of the CAT; this leads to 175 items in each pool in this example.

8

Page 9: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• Now, further assume that students will be allowed to take the CAT three times during a year. How many item pools are needed to attain the same exposure rate as the 50-item paper-based test being replaced?

– Three pools will be needed to achieve the same theoretical exposure rate as the paper-based test.

• Also, a statistical exposure control (such as Sympson-Hetter) will be needed to overcome the fact that, within a pool, certain items are selected very frequently by a procedure that maximizes test information.

9

Page 10: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• So, we are left with these number for item-pool development: 3 pools of 175 items each = 525 items, and using the general rule that one must write twice as many items as necessary, this means that 1,050 items must be written for this rather modest CAT project.

• Or perhaps more realistically, (525 – 150)*2 = 750

new items will have to be written if all the paper-based items are used in the item pools.

10

Page 11: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• The bottom line is that CAT:– can provide tests at a level appropriate for a student’s

ability.

– can save testing time and increase test reliability.

– is unlikely to save money because it can be a giant, item-eating machine.

– Offers the possibility of greater protection of the items from compromise than would be possible by the computer administration of a current paper-based test.

11

Page 12: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Evaluating a CAT Item Pool using Optimal Adaptive Tests (OATs)

• We are now going to construct some adaptive tests in an optimal way in order to illustrate some problems and to indicate an interesting possibility for implementing CAT.

• If one knew a person’s standing on the latent trait, θ, it would be easy to choose a fixed number of items (from some item pool) that will maximize the test information.

— We call such a test an “optimal adaptive test” (OAT) in that no other test from this item pool, and of the same length, could exceed this test’s measurement accuracy.

12

Page 13: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• The use of OATs for evaluating an item pool is now illustrated using an operational item pool for mathematics.

– This item pool contains 84 items, and is used to construct 15 item adaptive tests for various values of the latent trait.

– The items in the pool have • an average a-value of 1.61; S.D. = .51• an average b-value of -.06; S.D. = 1.10 • and an average c-value of .15; S.D. = .07

– For its intended purpose, this is an excellent item bank.

13

Page 14: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• Using a grid of θ’s from -3 to 3 at intervals of .5, 13 OATs were constructed from the 84-item bank.

• In order to illustrate the item-overlap in this collection of OATs, three of these were designated as focal OATs.

– These focal OATs were those at θ = -1.5, 0 and 1.5.– One might think of these as the optimal tests for three cut-

scores.– In the next three slides (one for each of the focal OATs),

the overlap with neighboring OATs are shown.– Accuracy of the OATs are indicated with information

functions and reliability coefficients.

14

Page 15: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

OAT at θ = -1.5 and Overlap with Neighboring OATs

Theta -3 -2.5 -2 -1.5 -1

Items Info. Items Info. Items Info. Items Info. Items Info.

1 0.04641 1 0.1505 1 0.3655 1 0.63316 1 0.73234

2 0.07012 2 0.16719 5 0.66628 2 0.55005 2 0.66594

5 0.13762 5 0.40163 12 0.44439 5 0.61413 11 0.64113

12 0.13777 12 0.282 13 0.61745 12 0.51887 16 0.55656

13 0.15398 13 0.39049 15 0.36296 13 0.59487 46 0.8022

14 0.10668 14 0.22847 16 0.39786 16 0.55072 47 0.58048

15 0.07031 15 0.18718 46 0.37141 46 0.73021 49 0.5616

16 0.09615 16 0.2189 47 0.51934 47 0.67994 50 0.55776

47 0.09194 47 0.26129 74 0.59481 49 0.51411 51 1.56191

49 0.05738 49 0.15774 75 1.14763 75 1.05888 52 0.73292

74 0.29611 74 0.55697 76 0.59026 76 1.46162 76 1.10421

75 0.08371 75 0.471 78 0.47144 77 0.5133 77 1.25759

78 0.20636 78 0.38526 80 1.39435 79 0.56519 79 0.87044

80 0.05043 80 0.43173 81 0.47526 80 1.32425 81 0.65866

82 0.27267 82 0.38417 82 0.37743 81 0.79822 83 0.59127

Test Info. 1.8776 4.6745 8.7963   11.1075 11.8750

Reliability 0.6529 0.8237 0.8979   0.9174 0.9223

OAT(-1.5) 0.4965 0.7681 0.8878 0.9174 0.9089 15

Page 16: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

OAT for θ = 0 and Neighboring OATs

Theta -0.5 0 0.5Items Info. Items Info. Items Info.6 0.64766 6 1.08164 28 1.9214711 0.65243 24 1.47013 32 2.306619 0.69925 25 2.4096 33 2.3097924 0.64689 26 1.37743 34 2.1451725 1.19992 31 1.89917 38 2.0406926 0.72767 32 1.67205 40 2.0069744 1.58144 44 1.91164 41 1.9062451 2.40629 52 1.23469 43 3.1883852 1.7944 53 1.88879 53 2.3235756 0.66504 54 1.11195 54 2.7635557 1.06812 56 1.38887 61 1.8890358 0.92946 57 1.13203 62 3.7414659 0.68789 62 1.5257 63 2.0805177 0.92247 64 2.9732 64 2.4457279 0.67854 65 1.12879 70 2.06511

Test Info. 15.3075   24.2057 35.1343Reliability 0.9387   0.9603 0.9723

OAT(0) 0.9121 0.9603 0.9571

16

Page 17: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

OAT for θ = 1.5 and Neighboring OATs

Theta 1 1.5 2 2.5 3Items Info. Items Info. Items Info. Items Info. Items Info.

 28 1.53919 3 1.13983 3 0.45161 3 0.12694 3 0.0323133 2.124 17 1.53923 4 0.33535 4 0.26237 4 0.1667734 1.77975 20 0.97318 10 0.25826 10 0.1473 10 0.0754535 3.35738 22 1.71518 17 3.40987 17 1.07741 17 0.1649436 2.17064 23 0.77671 20 0.96636 20 0.48872 20 0.1767437 2.32081 27 0.98121 21 0.48797 21 0.39597 21 0.2444138 2.88038 35 1.39005 22 1.56596 22 0.55974 22 0.1393739 3.38557 36 2.94918 23 1.24509 23 0.85198 23 0.3408640 1.71708 39 1.80579 27 0.30236 27 0.06867 27 0.0145342 2.03306 45 2.09891 36 0.62812 36 0.08192 36 0.0099643 2.25752 60 0.97349 39 0.28148 45 0.13509 45 0.0232260 2.48717 68 0.75244 45 0.70132 68 0.03909 68 0.008163 1.42694 69 1.66372 69 0.97469 69 0.28709 69 0.0682368 1.85907 72 0.96606 72 0.4508 72 0.15591 72 0.048570 2.23114 73 1.8975 73 1.62414 73 0.45751 73 0.09133

Test Info. 33.5697   21.6224 13.6833 5.1357 1.6047

Reliability 0.9710   0.9558 0.9319 0.8370 0.6161OAT(1.5) 0.9536 0.9558 0.9205 0.8159 0.5314

17

Page 18: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Theta -1.5 0 1.5Model 1pl 1pl 3pl 3pl 1pl 1pl 3pl 3pl 1pl 1pl 3pl 3pl

Items Info. Items Info. Items Info. Items Info. Items Info. Items Info.0 0.67121 1 0.63316 31 0.71917 6 1.08164 3 0.50536 3 1.13983

1 0.67489 2 0.55005 32 0.70854 24 1.47013 4 0.63154 17 1.53923

12 0.70988 5 0.61413 40 0.68068 25 2.4096 17 0.68444 20 0.97318

14 0.62767 12 0.51887 41 0.71363 26 1.37743 20 0.71491 22 1.71518

15 0.69896 13 0.59487 53 0.69831 31 1.89917 21 0.7225 23 0.77671

16 0.70696 16 0.55072 54 0.69419 32 1.67205 22 0.72229 27 0.98121

24 0.63287 46 0.73021 55 0.72112 44 1.91164 23 0.68994 35 1.39005

46 0.7199 47 0.67994 56 0.68608 52 1.23469 35 0.54217 36 2.94918

47 0.71271 49 0.51411 61 0.71213 53 1.88879 36 0.6379 39 1.80579

49 0.70902 75 1.05888 63 0.68367 54 1.11195 39 0.5543 45 2.09891

50 0.67925 76 1.46162 64 0.71868 56 1.38887 45 0.60129 60 0.97349

76 0.69996 77 0.5133 65 0.70998 57 1.13203 60 0.45861 68 0.75244

77 0.71838 79 0.56519 66 0.69992 62 1.5257 69 0.66773 69 1.66372

79 0.70734 80 1.32425 67 0.71819 64 2.9732 72 0.5458 72 0.96606

83 0.6989 81 0.79822 71 0.70932 65 1.12879 73 0.71576 73 1.8975

Test Info. 10.3679 11.10752 10.57361 24.20568 9.39454 21.62248

Reliability 0.912033 0.917407 0.913597 0.960326 0.903796 0.955796

Exp_scor 7.23 7.96 6.97 8.05 9.38 10.15

Focal OATS Derived Using the Rasch Model

18

Page 19: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

Conclusions

• The previous slides indicate that there will be considerable overlap in the CATs constructed from this item bank--in spite of the fact that there is considerable variability in the difficulty of the items.– Hence, many of the items will be “overly-exposed” and

subject to compromise.

• In the actual use of this item bank, the exposure of items is controlled using the Sympson-Hetter Exposure Control method.

19

Page 20: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• The previous slides also indicate that the three focal, OATs , optimal for θ = -1.5, 0 and 1.5, do a rather remarkable job of providing accurate measurement across the θ-continuum even though they only contain 15 items each.

• OATs—and by implication, CATs in general—will differ depending on the IRT model used in development and implementation.

20

Page 21: Multiple Perspectives on CAT for K-12 Assessments: Possibilities and Realities Alan Nicewander Pacific Metrics 1

• This also suggests, that a two-stage, CAT procedure would work quite well with this item bank.– In a two-stage CAT, an initial, Stage 1 test is administered

in either a CAT mode or as a fixed, medium difficulty test.– Scores on the Stage 1 test are used to assign examinees to

one of several Stage 2 tests which vary in overall difficulty from easy to difficult—for example one of the three, focal OATs described above.

– In this case (and perhaps in most cases), a pure CAT, where items are selected “on the fly”, does not seem to have any advantages over the pre-selected, optimal, Stage 2 tests.

21