what can we learn from the application of computer based assessment to the military?

What can we learn from the application of computer based assessment to the military?

Daniel O. SegallKathleen E. Moreno

Defense Manpower Data Center

Invited presentation at the conference on Computers and Their Impact on State Assessment: Recent History and Predictions for the Future, University of Maryland, October 18–19, 2010

Views expressed here are those of the authors and not necessarily those of the DoD or U.S. Government.

MD DC

Presentation Outline Provide some history of CBT Research and

Operational use in the Military.

Talk about some lessons learned over the past three decades.

–Many of these lessons deal not only with computer-based testing but with computerized adaptive testing (CAT).

End with some expectations about what lessons are yet to be learned.

2

MD DC

ASVAB History Armed Services Vocational Aptitude Battery

(ASVAB) –Before 1976, each military Service administered their

own battery.

–Starting in 1976, a single ASVAB was administered to all Military applicants.

–Used to qualify applicants for entry into the military and for select jobs within each Service.

–The ASVAB – Good predictor of training success.

3

MD DC

ASVAB Compromise Early ASVAB (Late 1970’s) – Prone to compromise

and coaching On-demand scheduling at over 500 testing locations Cheating was suspected from both applicants and

recruiters Congressional hearings were held on the topic of

ASVAB compromise One proposed solution: Computerized adaptive testing

version of the ASVAB–Physical loss of CAT items less likely than P&P test booklets –Sharing information about the test items less profitable for

CAT than P&P

4

MD DC

Initiation of CAT-ASVAB Research Marine Corps Exploratory Development Project

– 1977 Research Questions–First, could a suitable adaptive-testing delivery

system be developed?–Second, would empirical data confirm the anticipated

benefits? Findings –Data from recruits confirmed CAT’s increased

measurement efficiency–Hardware suitability?

Minicomputers slow and expensive5

MD DC

Joint Service CAT-ASVAB Project Initiated in 1979 Provide additional information about CAT Anticipated benefits:

–Test Compromise–Shorter tests –Greater precision –Flexible start/stop times –Online calibration –Standardized test administration (instructions/time-limits) –Reduced scoring errors (from hand or scanner scoring) –Possibility of administering new types of tests

6

MD DC

Early CAT-ASVAB Development Early development (1979) divided into two

projects:–Contractor delivery system development (hardware

and software to administer CAT-ASVAB) Commercially available hardware was inadequate for

CAT-ASVAB There was competition among vendors to develop suitable

hardware Competition abandoned by mid 1980’s because by then

commercially available computers were suitable for CAT-ASVAB

–Psychometric development and evaluation of CAT-ASVAB

7

MD DC

ASVAB Miscalibration A faulty equating of the first ASVAB in 1976 led to the enlistment of over

350,000 unqualified recruits over a five year period. As a result, a congressionally mandated oversight committee was

commissioned: The Defense Advisory Committee on Military Personnel Testing.

A central focus of the committee and military test development was to implement sound equating and score scaling methodologies.

Random equivalent groups equating methodology was implemented for the development of ASVAB forms and was used as the “gold standard” for all future ASVAB equatings.

This heightened sensitivity to score-scale and context effects formed the backdrop for next three decades of computer-based test development.– Mode of administration

– CAT to paper-equating

– Effects of different computer hardware on test scores

8

MD DC

Experimental CAT-ASVAB System Developed 1979 – 1985

The experimental CAT-ASVAB – Study adaptive testing algorithms and test development procedures

Full battery CAT version of the P&P-ASVAB for experimental use

Development Efforts– Psychometric development

– Item pool development

– Delivery system development

Experimental system used Bayesian ability estimation, maximum likelihood item selection band, and a rudimentary exposure control algorithm

9

MD DC

Joint-Service Validity Study Large Scale Validity Study: 1982-1984

Sample–Predictor and training success data

–N = 7,500 recruits training in one of 23 military jobs.

Results showed that –CAT-ASVAB and P&P-ASVAB predict training success

equally well.

–Equivalent validity could be obtained by CAT which administered about 40 percent fewer items than it’s P&P counterpart.

Strong support for the operational implementation of CAT-ASVAB.

10

MD DC

Operational CAT System Development 1985 – Present

Addressed a number of Issues:– Item Pools

– Exposure Control

– Calibration Medium

– Item Selection

– Time Limits

– Penalty for Incomplete Tests

– Seeding Tryout Items

– Hardware Requirements

– Usability Considerations

– Reliability and Construct Validity

– Equating

– Hardware Effects

– Test Compromise

– Score Scale

– New Form Development

– Internet Testing

– Software/Hardware Maintenance Issues

– Multi-Mode Testing Programs

11

MD DC

Item Pool Development (1980’s) CAT-ASVAB Forms 1 and 2: First two

operational forms –The P&P reference form (8A) was used to form the

basis of the test specifications, but alterations were made The adaptive pools – Wider range of item difficulties

–Pretest Items: About 3,600 items–Items screened on the basis of small-sample IRT item

parameter estimates–The surviving 2,118 items were administered to a

large applicant sample: N = 137,000 –Items were divided into two pools with about 100

items per subtest

MD DC

Item Pool Features CAT item pools do not need to be extraordinarily large

to obtain adequate precision and security

Exposure control can be handled by a combination of exposure control imposed by item selection, as well as the use of multiple test forms consisting of multiple (distinct) item pools

The use of multiple item pools (with examinees assigned at random to the pools) is an effective way to reduce item exposure rates and overlap among examinees.

13

MD DC

Exposure Control Experimental CAT-ASVAB system – Some items had

very high exposure rates 5-4-3-2-1 strategy (Wetzel & McBride, 1985)

–Guards against remembering response sequences–Does not guard against sharing strategy

Sympson and Hetter –Place an upper limit on the exposure rate of the most

informative items, and reduce the predictability of item presentation

–Usage of items of moderate difficulty levels reduced; Little or no usage restrictions for items of extreme difficulty or lesser discrimination

–Only small loss of precision when compared to optimal unrestricted item selection

14

MD DC

Calibration Medium Calibration Medium Concern

–Could data collected from paper-and-pencil booklets be used to calibrate items that would be eventually administered in a computerized adaptive testing format?

Because CAT was not yet implemented, calibration of CAT items on computers was not feasible.

Some favorable results existed from other adaptive tests which had relied on P&P calibrations

A systematic treatment of this issue was conducted for the development of the operational CAT-ASVAB forms using data collected from 3,000 recruits

–Calibration medium has no practical impact on the distributions or precision of adaptive test scores

15

MD DC

Calibration Medium Reading Speed is a primary cause of medium effects

Viewing/reading questions on computer is generally slower than Viewing/reading the same questions in a printed paper-based format

To the degree that tests are speeded (time-pressured), then medium is likely to have a larger impact

To the degree that tests are speeded, greater within medium effects can also occur

ASVAB approach: For power tests, reduce the time pressure by extending the time limits– Reducing time pressure for ASVAB power tests did not alter the construct

measured – Cross-correlation-check study

16

MD DC

Item Selection Rules Based on maximum item information (contingent upon

passing an exposure control screen)

Some consideration given to content balancing, but a primary emphasis was given to measurement precision

More recently, provisions have been made for item enemies

Maximizing precision was – and remains – a primary emphasis of the CAT-ASVAB item selection algorithm

17

MD DC

Time Limits CAT-ASVAB Time Limits

–Administrative requirements –Separate time limit for each adaptive power test

IRT Model–Standard IRT model does not model the effects of time

pressure on item responding Alternate Approaches for Specifying Time Limits

–Use the per-item time allowed on the P&P-ASVAB –Use the distribution of completion-times from an

Experimental version (which was untimed) and set the limits so that 95% of the group would finish

18

MD DC

Specifying Time Limits Untimed Pilot Study

–Supported the use of the longer limits For reasoning tests, high ability examinees took more

time than low-ability examinees High ability examinees would be most effected by

shortened time-limits since High ability examinees received more difficult questions, which required more time to answer

Opposite of relation between ability and test-time observed in most traditional P&P tests – In linear testing, low ability examinees generally take longer

than high ability examinees

19

MD DC

Penalty for Incomplete Tests Penalty Procedure Required for Incomplete Tests

–Due to the implementation of time-limits Bayesian Estimates

–Biased in the direction of the population mean–Bias stronger for shorter tests

Compromise Strategy–Below-average applicants answer minimum number of items

Penalty Procedure –Used to score incomplete adaptive tests –Discourage potential compromise strategy –Provides a final ability that is equivalent to the expected score

obtained by guessing at random on the unanswered items

20

MD DC

Penalty for Incomplete Tests Simulations

–Used to determine penalty functions (for each subtest and possible test length)

Penalty Procedure Features–Size of the penalty is correlated with the number of unfinished

items–Applicants who have answered the same number of items and

have the same provisional ability estimate will receive the same penalty

–With this approach, test-takers should be indifferent about whether to guess or to leave answers blank given that time has nearly expired

Generous Time Limits Implemented–Permit over 98 percent of test-takers to complete –Avoids disproportionately punishing high ability test-takers

21

MD DC

Seeding Tryout Items CAT-ASVAB administers unscored tryout items

Tryout items are administered as the 2nd, 3rd, or 4th item in the adaptive sequence

Item Position–Randomly determined

Advantages over Historical ASVAB Tryout Methods –Operationally Motivated–No booklet printing required–No special data collection study required

22

MD DC

Hardware Requirements Customized Hardware Platform – 1984

– Abandoned in favor of an off-the-shelf system The Hewlett Packard (HP) Integral Computer

– Selected as first operational system– Superior portability (17 pounds), – Large random access memory (1.5 megabytes), – Fast CPU (8 MHz 6800 Motorola), – Advanced graphics display capability (9 inch monitor with eltroluminescent display and

resolution of 512 by 255 pixels). – UNIX based operating system

– Supported the C programming language– Floppy diskette drive (no internal hard drive)– Cost about $5,000 (in 1984 dollars)

Lesson: Today’s Computers can easily handle item selection and scoring calculations required by CAT

Challenge: Even though today’s computer’s are thousands of times more powerful, they are not proportionately cheaper than computers of yesteryears

23

MD DC

User Acceptance Testing Importance of User Acceptance Testing

–Software development is obviously important

–User Acceptance Testing is equally important

Acceptance Testing versus Software Testing–Software Testing – Typically performed by software

developers

–Acceptance Testing – Typically performed by those who are most familiar with the system requirements

CAT-ASVAB Development–Time spent on acceptance testing exceeded that spent by

programmers developing and debugging code

24

MD DC

Usability Computer Usage: 1975 – 1985

–Limited primarily to those with specialized interests Concerns

–Deficient computer experience would lower CAT-ASVAB reliability and validity

–Although instructions had been tested on recruits, they had not been tested with applicants, many of whom scored over the lower ability ranges

– In addition, instructions had been revised extensively from the experimental system

Approach–Test instructions on a broad representative group of test-takers

who had no prior exposure to the ASVAB

25

MD DC

Usability Usability Study (1986)

– 231 military applicants and 73 high school students Issues Addressed

– Computer familiarity, instruction clarity, and attitudes towards CAT-ASVAB Method of Data Collection

– Questionnaire and Structured interviews Findings

– Test takers felt very comfortable using the computer, exhibited positive attitudes towards CAT-ASVAB, and preferred a computerized test over P&P – Regardless of their level of computer experience

– Test-takers strongly agreed that the instructions were easy to understand– Negative outcome: Most test-takers wanted the ability to review and modify

previously answered questions – Because of the requirements of the adaptive testing algorithm, this feature was not

implemented Lesson: Today with a well designed interface, variation in computer

familiarity among (young adult) test-takers should not be an impediment to computer based testing

26

MD DC

Usability Lessons Stay in tune with the computer proficiency of

the test-takers

–Tailor instructions accordingly

Do not give verbal instructions

–Keep all instructions on the computer

Keep user interface simple and intuitive

27

MD DC

Reliability and Construct Validity CAT Reliability and Validity

– Contents and quality of the item pool– Item selection, scoring, and exposure algorithms – Clarity of test instructions

Item Response Theory– Provides a basis for making theoretical predictions about these

psychometric properties– However, most assumptions are violated, at least to some degree

Empirical Test of Assumptions– To test the validity of key model-based assumptions, an empirical

verification of CAT-ASVAB’s precision and construct equivalence with the P&P-ASVAB was conducted

– If assumptions held true, then large amounts of predictive validity evidence accumulated on the P&P version would apply directly to CAT-ASVAB

– Construct equivalence would also support the exchangeability of CAT-ASVAB and P&P-ASVAB versions

28

MD DC

Reliability and Construct Validity Study Design

–Two Random Equivalent Groups–Group 1: (N = 1033) received two P&P-ASVAB forms –Group 2: (N = 1057) received two CAT-ASVAB forms –All participants received an operational P&P-ASVAB

Analyses–Alternative forms correlations used to estimate reliabilities–Construct equivalence was evaluated from disattenuated

correlations between CAT-ASVAB and operational P&P-ASVAB versions

29

MD DC

Reliability and Construct Validity Results – Reliability

– Seven of the ten CAT-ASVAB tests displayed significantly higher reliability coefficients than their P&P-ASVAB counterparts

– Three other subtests displayed non-significant differences Results – Construct Validity

– All but one disattenuated correlation between CAT-ASVAB and P&P-ASVAB was equal to 1.0

– Coding Speed displayed a disattenuated correlation substantially less than one (.86)

– However composites that contained this subtest had high disattenuated correlations approaching 1.0

Discussion– Results confirmed the expectations based on theoretical IRT predictions– CAT-ASVAB measured the same constructs as P&P-ASVAB with

equivalent or greater precision

30

MD DC

Equating CAT and P&P Versions 1980’s – Equating viewed as a major psychometric

hurdle to CAT-ASVAB implementation Scale Differences between CAT-ASVAB and P&P-

ASVAB–P&P-ASVAB used a number-correct score-scale –CAT-ASVAB produces scores on the natural (IRT) ability

metric–Equating must be done to place CAT-ASVAB scores on the

P&P-ASVAB scale Equating Objective

–Transform CAT-ASVAB scores so its score distribution would match the P&P-ASVAB score distributions

–Transformation would allow scores on the two versions to be used interchangeably, without effecting applicant qualification rates

31

MD DC

Equating Concerns

Overall Qualification Rates

–Equipercentile equating procedure used to obtain the required transformations

–Distribution smoothing procedures

–Equivalence of composite distributions verified

–Distributions of composites were sufficiently similar across P&P-ASVAB and CAT-ASVAB

32

MD DC

Equating Concerns Subgroup Differences

–Concern that subgroup members not be placed at a disadvantage by CAT-ASVAB relative to P&P-ASVAB

–Existing subgroup differences might be magnified by precision and dimensionality differences between CAT and P&P versions

Approach–Apply the equating transformation (based on the entire group)

to subgroup members taking CAT-ASVAB–Compare subgroup means across CAT and P&P versions

Results–No practical significance for qualification rates was found

33

MD DC

Online Calibration and Equating

All Data can be collected seamlessly in an operational environment to both calibrate and equate new CAT-ASVAB forms

This is in contrast to earlier form development which required special data collections using special populations

34

MD DC

Hardware Effects Study Hardware Effects Concern (1990’s)

– Differences among computer hardware could influence item functioning

– Speeded tests especially sensitive to small changes in test presentation format

Dependent Measures

– Score-scale

– Precision

– Construct validity

Sample

– Data were gathered from 3,062 subjects

– Each subject was randomly assigned to one of 13 conditions

Hardware Dimensions

– Input device

– Color scheme

– Monitor type

– CPU speed

– Portability

Results

– Adaptive power tests were robust to differences among computer hardware

– Speed tests are likely to be effected by several hardware characteristics.

35

MD DC

Stakes by Medium Interaction Equating Study

–Equating of Desktop and Notebook computers

Two Phases

–Recruits – Develop a provisional transformation; random groups design with about 2,500 respondents per form

–Applicants – Develop a final transformation from applicants to provide operational scores. Sample size for this second phase was about 10,000 per form

36

MD DC

Stakes by Medium Interaction Comparison of Equatings Based on Recruits (non-operational motivation) and

Applicants (operational motivation)

Differences in the CAT-P&P equating transformations were observed

The difference was in a direction that suggested that in the first (nonoperational) equating, CAT examinees were more motivated than P&P examinees (possibly due to shorter test lengths or novel/interactive medium)

It was hypothesized that there were different levels of motivation/fatigue between CAT and P&P groups in the nonoperational recruit sample than in the operational applicant sample

Findings suggested that the results of a cross-medium equating may differ depending upon whether the respondents are motivated or unmotivated

For future equatings, this problem was avoided by:

– Performing equatings in operationally motivated samples, or by

– Performing within medium equatings when test-takers were nonoperationally motivated (and by using a chained-based transformation if necessary to link back to the desired cross-medium scale)

37

MD DC

Test Compromise Concerns Sympson-Hetter algorithm assumes a particular known

ability distribution

Usage rates might be higher for some items if the actual ability distribution departs from the assumed distribution

Since CAT tests tend to be shorter than P&P tests, each adaptively administered item might have a greater impact on final score

So preview of a fixed number of CAT items may result in a larger score gain than preview of the same number of P&P items

38

MD DC

Test Compromise Simulations Simulation Study – Conditions

– Transmittal mechanism (sharing among friends or item banking)

– Correlation between the cheater and informant ability levels

– Method used by the informant to select items for disclosure

Dependent Measure

– Score gain (mean gain for the group of cheaters over a group of non-cheaters for the same fixed ability level)

Results

– Score gains for CAT were larger than those for the corresponding P&P conditions

Implications

– More stringent item exposure controls should be imposed on CAT-ASVAB item exposure

– The introduction of a third item pool (where examinees are randomly to assigned to one of three pools) provided score gains for CAT that were equivalent to or less than those observed for six forms of the P&P-ASVAB under all compromise strategies

– These results led to the decision to implement an additional CAT-ASVAB form

39

MD DC

Score Scale For testing programs that run two parallel modes of

administration (i.e., paper-based and CAT), equating and measurement precision can be enhanced by scoring all tests (including the paper-based test) by IRT methods

IRT scoring of the paper-based tests provides distributions of test scores that more closely match their CAT counterparts (i.e., helps make them more Normal)

IRT scoring also reduces ceiling and floor effects of paper-based number-right distributions which can attenuate the precision of equated CAT-ASVAB scores

An underlying theta (or natural ability scale) can facilitate equating and new form development

40

MD DC

New Form Development The implementation of CAT-ASVAB on a large-scale has enabled

considerable streamlining of new form development

DoD has eliminated all special form-development data-collection studies by replacing them with online calibration and equating

According to this approach, new item data is collected by seeding tryout items among operational items

These data are used to estimate IRT item parameters

These parameters are in turn used to construct future forms, and to estimate provisional equating transformations

These provisional (theoretical) equatings are then updated after they are used operationally to test random equivalent groups

Thus, the entire cycle of form development is seamlessly integrated into operational test administrations

41

MD DC

Internet Testing DoD Internet Testing

– Defense Language Proficiency Tests

– Defense Language Aptitude Battery

– Armed Forces Qualification Test

– CAT-ASVAB

Implications for Cost-Benefits

Implications for Software Development– Desktop Lockdown

– Client Side: Unanticipated effects on test delivery from browser, operating system, and security updates

– Server Side: Unanticipated effects of operating system updates on test delivery

Interactions with other applications running on the same server

42

MD DC

Internet Testing Internet testing can defray much of the cost of computer

based testing since the cost of computers and their maintenance is shared or eliminated.

Strict Standardization of administration format (line breaks, resolution, etc.) in internet testing is difficult (and sometimes impossible) to enforce.

Lesson: With Internet testing you do not have to pay for the computers’ purchase and maintenance, but you do pay a price for the lack of control over the system.

43

MD DC

Software/Hardware Maintenance Issues Generations of CAT-ASVAB Hardware/ Software

– Apple III

– HP

– DOS

– Windows I

– Windows II

– Internet

Early generations of hardware/software could be treated as static entities, much like test-booklets

Later Windows and Internet generations require treatment more like living entities – They require continuous care and attention (security, operating system, and software updates)

44

MD DC

Multi-Mode Testing Programs When transitioning from paper-based testing to

computer-based testing, decide ahead of time if the two mediums of administration will run in parallel for an extended period, or if paper-based testing will be phased out after a fixed period of time

If the later, make sure this is communicated and that there is strong policy support for the elimination of all paper-based testing

There are different resourcing requirements and cost drivers for dual and single mode testing programs

45

MD DC

Future Lessons ? Intranet verses Internet -based Testing

Computer hardware effects on test scores

How to test speeded abilities on unstandardized hardware?

Can emerging technologies (such as mobile computing devices) provide additional or different benefits (above and beyond computers) for large-scale assessments?

46

what can we learn from the application of computer based assessment to the military?

Documents

asvab compromiseearly

single asvab

military test development

initiation of cat

evaluation of cat

topic of asvab compromiseone

military service

military applicants