standard setting for professional certification brian d. bontempo mountain measurement, inc....

Post on 16-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Standard Setting forProfessional Certification

Brian D. BontempoMountain Measurement, Inc.

brian@mountainmeasurement.com

(503) 284-1288 ext 129

Overview

• Definition of Standard Setting• Management Issues relating to

Standard Setting• Standard Setting Process• Methods of Standard Setting• Using multiple methods of Standard

Setting

Definition of Standard Setting• Standard setting is a process

whereby decision makers render judgments about the performance level required of minimally competent examinees

Types of Standards

• Relative Standard (Normative Standards)– Top 70% of scores pass– 20 points above average

• Criterion-Referenced Standard (Absolute Standards)– 70% of the items correct– 600 out of 800 scaled score– .05 logits– 20 items correct

Why do we conduct Standard Setting?• To objectively involve stakeholders in

the test decision making process• To connect the expectations of

employers to the test decision making process

• To connect the reality of training to the test decision making process

• To ensure psychometric soundness & legal defensibility

When to (re)set a passing standard• For a new exam, after Beta Test data have

been analyzed, typically after “Live” Test Forms have been constructed

• For exam revisions, when the expectations of a job role have changed– Practice has changed– Content domain has changed– It is not appropriate to change the passing

standard whenever a test or training has been revised.

– It is not appropriate to change the passing standard because of supply and demand issues (too many/few certified professionals)

Who should lead a standard setting panel?• An experienced Psychometrician

– Insider perspective, familiar with your certification and exam development

– Outsider perspective, not familiar with your certification and exam development

How rigid should you be in your direction to the Psychometrician?• I recommend a conversation

between the Psychometrician and the Test Sponsor to figure out what works best. Typically a test sponsor will specify a framework (e.g., Angoff) and let the Psychometrician dictate the specifics.

Outcomes of Standard Setting• A conceptual (qualitative) definition of

minimal competency• A proposed numeric (quantitative) passing

standard• A set of alternate passing standards based

on errors in the process• Expected passing rate(s) from each

standard• A report documenting the process and the

psychometric quality of the process

Standard Setting Process

Standard Setting Process

• Gather test data• Assemble a group of judges

– Define minimal competency– Train judges on the method– Render judgments on the performance of

borderline examinees

• Calculate the passing standard by aggregating the judgments

• Evaluate the outcome by calculating the expected passing rate

Selecting your judges

• Representative Sample– Hiring Managers– Trainers– Entry-Level Practitioners

• How many judges is enough?– For a low stakes exam

• at least 8 judges

– For a medium stakes exam• at least 12 judges

– For a high stakes exam• at least 16 judges

Developing a Definition of Minimal Competency• Identify 3 common tasks within each

domain of the test blueprint (an easy, a hard, and a “Borderline” task)

• Characterize the performance of minimally competent examinees on each of the major tasks

• Write text that summarizes these discussions

Training Judges

• Instruct them on their task• Practice rating items

– Two sets of practice items

• Practice discussing items• Explain the stats that you will be

providing them• Set the tone and boundaries for good

‘group psychology’

Standard Setting Methods

Types of Standard Setting Methods• Examinee-Centered Methods

– Judges use external criteria, such as on the job performance, to evaluate the competency of real examinees

• Test-Centered Methods– Judges evaluate the performance of imaginary

examinees on real test items• Adjustments

– in order to account for inaccuracy in the standard setting process, Psychometricians use real test data to provide a range of probable values for the passing standard

Examinee-Centered Methods• Borderline group

– Using external criteria (such as performance on the job), judges identify a group of examinees that they think are borderline examinees. The average score of this group is the passing standard

• Contrasting groups– Using external criteria, judges classify examinees

as passers or failers. The passing standard is established by determining the point which discriminates the best between the scores of both groups

Test-Centered

• Modified-Angoff– Angoff, W.H. (1971) Scales, Norms, and

equivalent scores. In R.L. Thorndike (Editor) Educational Measurement 2nd edition: Washington, DC American Council on Education.

• Bookmark– Mitzel, H.C., Lewis, D.M., Patz, R.J., & Green,

D.R. (2001). The Bookmark Procedure: Psychological perspectives. In G.J. Cizek (Editor), Setting Performance Standards: Mahwah, NJ Lawrence Erlbaum Associates.

Basic Angoff Process

• Judges evaluate each item– What percentage of MC examinees would

get the item correct?• Feedback/Discussion• Judges make adjustments to their

ratings• Average of all items is the judges

passing standard• Average of all judges’ standards is the

passing standard

Common Angoff Issues

• What percentage of– MCs vs. all– MCs is correct

• candidates– “would” vs. “should”– “would” is correct

• get the item correct?

Common Angoff Issues

• What type of ratings should judges make?– 1/0 (Yes/No) – Percentage of Borderline examinees

• Round to 1 decimal (.9)• Round to 2 decimals (.92)

– NEVER use percentage of all examinees

Common Angoff Issues

• Types of Feedback to provide– Group Discussion

• Relate to conceptual definition of minimal competency

– Typical or atypical content– Relevancy

• Relate to item nuances– Item Stem– Item Distractors

• “I expect a lot of the MC because this is core content and the item is straightforward.”

• “I would like to cut the MC some slack because this is not covered well in training and the scenario is a little abstract.”

Common Angoff Issues

• Types of Feedback to provide– Empirical Data

• Answer Key – Yes!• Percentage of Borderline examinees

answering the item correctly – If possible yes

• P-Value (Percentage of examinees answering the item correctly) – Only if the percentage of Borderline examinees is not available

Common Angoff Issues

• When to provide feedback?– Initial Rating– Discuss items– Secondary Rating– Provide Empirical Data– Tertiary Rating

Bookmark• Test is divided up into sub tests

– By domain OR– Equal variance of difficulty across sub tests

• Items are sorted from easiest to hardest– By judges OR– By actual value

• Judges bookmark the subtest at the point where the MC examinee would stop getting items correct and start getting them incorrect

• The lowest possible standard• The expected standard• The high possible standard

• Judges discuss ratings & make adjustments• Passing standard is average # of items answered

correct

Common Bookmark Issues• How many Ordered Item Booklets

(OIB)– One for each content domain– An equivalent number that meet the

test plan

Common Bookmark Issues• How should I select Items for the

OIB?– Minimize the distance in difficulty

between any two adjacent items.• Ensure that there are enough items at all

difficulty levels for each OIB • Ensure that the variance in item difficulty is

the same for each OIB

Common Bookmark Issues• How should I sort the item booklets?

– Easiest to Hardest– Hardest to Easiest

Common Bookmark Issues• How do I know when the MC would

stop getting items correct and start getting them incorrect? (What is the appropriate RP value?)– .5– .67* Most Common– .75

Common Bookmark Issues• How do I convert the bookmark to a

passing standard?– Previous Item (PI) – Take the difficulty of

the easier of the two items on either side of the bookmark

– Between Item (BI) – Take the average of difficulty of the two items

Compare Angoff and Bookmark• Angoff requires less preparation

– Select a real test form as opposed to building the OIBs

• Judges understand Bookmark better– Rating the difficulty of an item is a

difficult task• Bookmark requires more test items

– I’d recommend an item pool of at least 40 solid test items per content domain

Other Test Centered Methods• Ebel• Nedelsky• Jaeger• Rasch Item Mapping

Ebel

• Judges sort each item into piles– How difficult is this item for the MC examinee?

• Easy, moderate, or hard

– How relevant is this content for practice?• Critical, Moderately important, Not relevant

• Judges then estimate the percentage of items in each that MC examinees would get correct

• The passing standard is then determined by multiplying the number of items in each cell by the percentage and sum all values

Nedelsky

• Judges determine which response options are unrealistic for each item

• The probability of a guessed correct response is calculated

• The sum of the probabilities is the passing standard

Jaeger

• Judges evaluate each item– Yes/No - “Should every entry-level

practitioner answer this item correctly?”• Judges discuss ratings & make

adjustments• Judges are provided passing rate based

on standard & make adjustments• Passing standard is calculated by

summing the number of “Yes” responses

Test-Centered Options

• What the ratings are based on– Should or would MC get this right

• How ratings are made– Yes/No, Percentage

• Relevance adjustments• Guessing adjustments• What kind of feedback is provided

– Passing rate– Other judges ratings– Actual item difficulty

Using Multiple Methods of Standard Setting

Why use Multiple Methods?• There is error in every standard

setting• Allows policymakers to “decide” on

the standard rather than science simply documenting the outcomes of a panel

• Allows for the recovery of standard setting sessions that go awry

• Involves more stakeholders

Adjustments

• Simple Stats – Calculate the confidence interval around the estimate

• Beuk – Judges provide an expected passing score and an expected passing rate. Calculations are made that are based on the variability in these two estimates

• De Gruijter – Similar to Beuk, judges also provide an estimate of the uncertainty of their judgments.

• Hofstee – Judges indicate the highest and lowest passing score and passing rate. These values are plotted along with the cumulative frequency distribution and the point of intersection is the passing standard

Survey of Hiring Managers• Ask hiring managers about the

workforce– What percentage of certified persons do

you believe to be minimally competent?– Are your certified persons more

competent that your uncertified persons?

• Expands the reach of your exam

Triangulating results

• Psychometrician should present the outcome of each method and the passing rate associated with each outcome– A range of possible values

• Policymakers can use this information and “their professional experience” to set the actual passing standard

Wrap-Up

3 Vital Recommendations• Have more judges at standard

setting• Spend more time training your

judges• With each standard setting ensure

that you take the time to define minimal competency conceptually and don’t forget to document this definition.

Concluding Remarks

• Many people like to think of test makers as big bad people which is obviously not true. Standard setting is one example of how inclusive the scientific process of test development can be. I encourage folks to make this process light and fun.

Thank you for paying attention!

Questions & Comments:brian@mountainmeasurement.com

top related