meta-analysis of personnel selection tests & overview of situational judgment tests michael a....

Meta-Analysis of Personnel Selection Tests

&Overview of Situational

Judgment TestsMichael A. McDaniel

Virginia Commonwealth University [email protected]

Deborah L. WhetzelHuman Resources Research Organization

[email protected]

Prepared for:International Workshop on “Emerging Frameworks and Issues for S&T Recruitments”Society for Reliability Engineering, Quality and Operations Management (SREQOM)

Delhi, India

September, 2008

Nhung NguyenTowson [email protected]

SREQOM, Delhi, India 2

Overview Introduction—Dr. McDaniel Introduction—Dr. Whetzel Introduction—Dr. Nguyen Meta-analysis Meta-analysis results What are SJTs? Brief history of SJTs Item characteristics, response instructions,

and item heterogeneity Steps in developing SJTs Scoring SJTs


Dr. McDaniel’s Department

Department of Management, School of Business, Virginia Commonwealth University, Richmond, Virginia 100 miles south of Washington, DC


Dr. McDaniel’s Department

PhD program in Management emphasizes organizational behavior and human resources.

The Center for the Advancement of Research Methods and Analysis (CARMA) is a non-profit unit of the School of Business at Virginia Commonwealth University (VCU). Established in 1997 by Dr. Larry Williams Hosted over 60 events and 100 presentations on research

methods topics for faculty and doctoral students world-wide. Interdisciplinary focus, with emphasis on topics relevant to the

social and organizational sciences. (www.pubinfo.vcu.edu/carma/)


Dr. McDaniel’s Research Theme

Applications of meta-analysis to examine the validity of personnel selection methods:

Cognitive ability tests Interviews Reviews of training and experience Customer service tests Firefighter tests Short-term memory tests Job experience Job knowledge Situational judgment tests (SJTs)


Dr. Whetzel’s Organization

Human Resources Research Organization (HumRRO) in Alexandria, Virginia 5 miles from Washington DC


The Human Resources Research Organization (HumRRO)

Independent non-profit research organization Established in 1951 as part of the U.S.

Army Became independent in 1969 Headquarters in Alexandria, VA

Diverse staff: industrial/organizational psychologists, instructional designers, statisticians, management analysts, web programmers

100 professional staff, 20 support staff Strong history in selection, assessment,

training, and evaluation


Deborah Whetzel Experience in personnel selection research

and development Areas of expertise include:

conducting job analyses, developing competency models, developing performance appraisal systems, and developing and validating assessment processes,

including structured interviews and SJTs.


Dr. Nguyen’s Department Towson University, Towson, Maryland

60 miles north of Washington,DC


Nhung Nguyen Experience in personnel selection

research and development Areas of expertise include:

Situational judgment test research, including subgroup differences

Wrote monograph for the International Personnel Management Association Assessment Council (IPMAAC)

Meta-Analyses of Personnel Selection Tests


What is Meta-Analysis?

Meta-analysis is the quantitative combination of information from multiple empirical studies to produce an estimate of the overall magnitude of a relationship between an employment test and job performance.

Meta-analysis uses statistical procedures to determine the best estimate of the correlation between test and job performance.


Meta-Analysis

Although meta-analysis can be statistically complicated, conceptually it is simple.

In a meta-analysis of employment tests, one averages the correlations between the test and job performance across studies. The correlations are typically called “validity

coefficients.” Weight the studies so that studies with greater

numbers of participants have more effect on the average.


Meta-Analysis

In addition to calculating the mean (average) validity, one also looks at the variability around the mean.

Some of the variability is due to random causes (sampling error).

Particularly with smaller sample studies, results will vary due to some samples being more representative of the population than other samples.


Meta-Analysis

The next few slides may be difficult for those without statistical training.

Do not worry if you do not understand all of it. We will get to the typical validities of different

types of selection tests soon. These typical validities are what one needs to

make informed decisions.


Meta-Analysis and Sampling Error

Sample Size

Sam

plin

g E

rror



The relationship between sampling error and sample size is asymptotic. Increasing sample size results in decreasing

random sampling errorAs sample size increases, one gets

diminishing returns in the reduction of random sampling error.



Meta-analysis combines the results from many different studies.

Sampling error across studies tends to cancel out (sampling errors in one direction will be balanced by the sampling errors in the other direction).

The meta-analytic result gives a close approximation to the population value.


Meta-Analysis and Moderators

Some of the variability is likely due to the test working better for some jobs than others.

These job effects are moderators of the correlation between the test and job performance. Extroversion is better predictor of sales jobs than

most other jobs. Cognitive ability tests are good predictors of all jobs

but work the best for cognitively demanding jobs (i.e., S&T jobs)


Meta-Analysis

Meta-analysis tries to partition variability in the studies to better understand it.

Sampling Error

Artifactual variance in addition to sampling error

Moderator


Meta-Analysis

So far, we have briefly talked about random sampling error and moderators.

But the graph shows some variance called “artifactual variance in addition to moderators.”

Measurement error and range restriction differences across studies are this additional artifactual variance.


Meta-Analysis: Measurement Error

All measures (job performance ratings) have some measurement error.

The more the measurement error, the lower the reliability of the measure.

Measurement error causes an observed effect size to underestimate its population parameter.

Differences across studies in measurement error cause variance.


Meta-Analysis: Range Restriction

Personnel selection literature has range restriction due to the pre-selection of the sample. Seek to know the value of a predictor of

performance but only have job performance data for the sample members who score high on the predictor.

Range restricted samples tend to underestimate the validity coefficient.


Meta-Analysis and Artifactual Variance Meta-analysis methods try to estimate the

effects of measurement error and range restriction when determining the mean validity of an employment test and the extent to which it accounts for variability in validities.


Meta-Analysis: Publication Bias

A growing concern in the meta-analysis literature is whether the studies available to average are representative.

Some employment test publishers have been known to suppress studies that make their testing products look bad.

This causes the studies available to the reviewer to overestimate the validity of the test.


Meta-Analysis: Know what you are predicting Another consideration is what measures of job

performance are used. For example, integrity test validity studies often

use a self-report of employee theft as the measure of job performance. Far fewer studies correlate the tests with job performance as measured by more common measures (e.g., supervisor ratings).


Meta-Analysis: Summary of Validity

Schmidt and Hunter (1998) summarized 85 years of research findings

The next table shows validity for predicting job performance (typically, supervisor ratings)

All corrected for downward bias due to measurement error, range restriction using incumbent samples

A second table presents results for personality tests.


Meta-Analysis ResultsEmployment test Validity (r)

General mental abilityFrom Hunter (1980)

.51

InterviewsFrom McDaniel et al (1994)

.38 (unstructured)

.51 (structured)

Job knowledge testsFrom Hunter & Hunter (1984)

.48

Training and experience (behavioral consistency)From McDaniel et al (1988a)

.45

Assessment centersFrom Gaugler et al (1987)

.37

BiodataFrom Schmidt & Hunter (1998)

.35

Situational judgment testsFrom McDaniel et al (2007) (No range restriction correction)

.26


Meta-Analysis Results for Personality

Personality Test Validity (r)

Conscientiousness .20

Emotional stability .13

Agreeableness .11

Extraversion .09

Openness .06

From Hurtz & Donovan, 2000

Situational Judgment Tests


What Are SJTs?

An applicant is presented with a situation and several response options and is asked to evaluate the responses.

SJT items are typically in a multiple choice format.


Everyone in your work group has received a new computer except you. What is the best action to take?

A. Assume it was a mistake and speak to your supervisor.

B. Ask your supervisor why you are being treated unfairly.

C. Take a new computer from a co-worker’s desk.

D. Complain to human resources.

E. Quit.


Brief History

Judgment scale in the George Washington University Social Intelligence Test (Moss, 1926)

Used in World War II by psychologists working for the US military

Practical Judgment Test (Cardall, 1942)


Brief History continued

How Supervise? (File & Remmers, 1948) 1949 Test of Supervisory Judgment

(Richardson, Bellows & Henry, 1949) 1960’s SJTs were in use at the U.S. Civil

Service System (Greenberg, 1963)



1990’s Motowidlo reinvigorated interest in SJTs (Motowidlo et al. 1990; Motowidlo, & Tippins, 1993) “Low fidelity” simulations

1990’s Sternberg “tacit knowledge” tests (Sternberg et al, 1993, 1995; Wagner & Sternberg, 1991)



Today, SJTs are used in many organizations, are promoted by various consulting firms, and are researched by many.



Current popularity is based on assertions that SJTs:Have low adverse impact (subgroup

differences)Have good acceptance by applicants Assess job-related knowledge or skills not

readily tapped by other measures


Item Characteristics Item stems can be distinguished along

five characteristics:FidelityLengthComplexityComprehensibilityNested and un-nested stems


Item Characteristics continued

Fidelity: Extent to which the format of the stem is consistent with how the situation would be encountered in a work setting.High fidelity: Situation is conveyed through a

short video.Low fidelity: Situation presented in written

form.



Length:Some stems are very short (How Supervise?,

File & Remmers, 1971)Other stems present very detailed

descriptions of situations (Tacit Knowledge Inventory, Wagner & Sternberg, 1991)



Complexity: Stems vary in the complexity of the situation presented.Low complexity: One has difficulty with a new

assignment and needs instructions.High complexity: One has multiple supervisors,

who are not cooperating with each other and who are providing conflicting instructions concerning which of your assignments has highest priority.



Comprehensibility: It is more difficult to understand the meaning and import of some situations than other situations.Sacco, Schmidt & Rogg (2000) examined the

comprehensibility of item stems using reading formula.



Length, complexity, and comprehensibility of the stems are interrelated and probably drive the cognitive loading of the items.



Nested stemsSome situational judgment tests (Clevenger &

Halland, 2000; Parker, Golden & Redmond, 2000) present an overall situation followed by subordinate situations.

Subordinate stems are the stems linked to the responses.


Item Characteristics Nature of Responses Unlike item stems that vary widely in

format, item responses are usually presented in a written format and are relatively short.Even SJTs that use video to present the

situation often present the responses in written form, sometimes accompanied by an audio presentation.


Item Characteristics Response Instructions The various item instructions can be

described in a two-dimensional taxonomy:(1) Behavioral tendency vs. Knowledge

- how do you typically behave vs. what is the most effective response

(2) Number of scorable responses


Item Characteristics Response Instructions

One Scoreable Response

Two Scoreable Responses

As Many Scoreable Responses as Response

Options

Behavioral Tendency

•What would you most likely do?

•What would you most likely do?• What would you least likely do?

•Rate each response for the likelihood you would perform the response.•Rank the responses from the most likely to the least likely.

Knowledge •Pick the best answer.•What should you do?

•Pick the best answer and pick the worst answer.•Pick the best and second best.

•Rate each response for effectiveness.•Rank the responses from the best to the worst.

Item Heterogeneity


Item Heterogeneity

SJT items tends to be construct heterogeneous at the item level. This means that they measure many things. They are typically correlated with one or more of the

following: Cognitive ability Agreeableness Conscientiousness Emotional stability


Scenario A

You assigned a very high profile project to one of your project managers. During each of the project update meetings, your project manager indicates that everything is going as scheduled. Now, one week before the project is due, your project manager informs you that the project is less than 50% complete.

Correlation with:

Responses: g(n = 448-450)

Conscientiousness(n = 1196-1222)

Agreeableness(n = 1196-1222)

Personally take over the project and meet with the customer to determine critical requirements.

.10* .01 -.13*

Meet with the customer to extend the deadline. Talk with the project manager about how the lack of communication has jeopardized the company’s relationship with the customer.

.11* -.03 -.05

Fire the project manager and take over the project yourself.

.08 .00 -.16*

Coach the project manager on how to handle the project more efficiently.

-.17* .01 .09

Do not assign any high profile jobs to this project manager in the future.

.13* .07 -.08


Scenario B

You lead a project that requires specific, accurate data to make decisions. The data-capturing method currently being used does not provide you with the information you need. Another department promised to provide you with the information, but failed to do so at the last minute. This setback delayed your project and you are certain that you will require the information to complete your project accurately.

Correlation with:

Responses: g(n = 448-450)

Conscientiousness(n = 1196-1222)

Agreeableness(n = 1196-1222)

Do the time-consuming work yourself even though it is not technically your responsibility.

.07 .11* -.08

Temporarily allocate some member of your team to capture the data.

-.01 .11* .00

Ask the customer for a deadline extension and explain that the other department failed to provide the necessary information.

.12* .06 -.02

Ask your manager to pressure the other department to deliver the information.

.17* .02 -.10*


Degree of item heterogeneity

Implications of item heterogeneity:Difficult to get interpretable factor analysesDifficult to build subscales that show

discriminant validity A set of items intended to measure

conscientiousness will likely be correlated with conscientiousness but also with cognitive ability, agreeableness, and emotional stability.



More implications of item heterogeneity:Difficult to specify the constructs assessed by

the test.Difficult to defend use of internal consistency

as a reliability estimate. Best to use test-retest reliability



Probably best to think of SJTs as a measurement method in which you can and typically do measure multiple constructs.

Overview of SJT Test Development


Overview of SJT Test Development

Identify a job or job class for which a SJT is to be developed

Write critical incidents Sort critical incidents Turn selected critical

incidents into item stems

Generate item responses

Edit item responses Determine response

instructions Develop a scoring key


Development Issues

Identify a job or job class

Get clarification on the job(s) for which the SJT is intended.

Determine if supervisor jobs are included in a job “class” and if separate supervisor items are needed.


Development Issues

Critical Incidents

Motowidlo et al. (1990, 1997) recommended having subject matter experts write critical incidents to generate stems and use additional subject matter experts to generate responses. A subject matter expert is someone who is very

knowledgeable about the job (e.g., an incumbent or supervisor)

Some test authors just write items.


Development Issues

Critical Incidents

Recommend critical incidents from subject matter experts It is unlikely that an item writer can come up

with the richness and breadth of scenarios that can be generated by a group of subject matter experts writing critical incidents.


Development Issues

Critical Incident WorkshopsCritical Incident Form

1. What was the situation leading up to the event? [Describe the context.]

2. What did the employee do?

3. What was the outcome or result of the employee’s action?

4. What competency category is most relevant for this incident?

5. Circle the number below that best reflects the level of performance that this event exemplifies.

1 2 3 4 5 6 7

Low High


Development Issues

Critical Incident WorkshopsCritical Incident Form1. What was the situation leading up to the event? [Describe the context.] I was the

lead scientist on a complex project that had a very short timeline. My team and I identified some creative solutions to an ongoing problem addressed by the project. When the project was nearly completed, my supervisor told me that someone in another agency was assigned the same project and it was completed last week.

2. What did the employee do? Although I was disappointed, I asked if our creative solutions could be used by the other team. I explained the issue to my co-workers and asked them to stop work. I then asked my supervisor for the next project.

3. What was the outcome or result of the employee’s action? My supervisor appreciated my flexibility and willingness to provide our solutions to the other team and thanked me for my willingness to shift focus toward a different project.

4. What competency category is most relevant for this incident? Adapting to change

5. Circle the number below that best reflects the level of performance that this event exemplifies.

1 2 3 4 5 6 7 Low High


Development Issues

Critical Incident Workshops

Plenty of room/privacy/anonymity Critical incidents are often embarrassing to someone

(My boss did this stupid thing…). Anonymity permits these critical incidents to be

offered.

Raise comfort level Spelling is not important Interested in the story, not the quality of the writing.


Development Issues


Prompts for generating critical incidents: Think about a time when someone did a really good

job. Think about a time when someone could have done

something differently. Think of a recent work challenge you faced and how

you handled it. Think of something you did in the past of which you

were proud.


Development Issues


Individual feedback on initial critical incidentsReinforce productivity

Consider laptops. Many people are more comfortable typing for three hours than writing with a pen.


Development Issues


Conduct at least two critical incident workshops. In the first workshop, let them write on

whatever they want. In the following workshops, direct them away

from topics that have already been covered and toward topics that need better coverage.


Development Issues

Sort Critical Incidents

Sort incidents into categories based on similarity of content.

Provide dimension name for each category.

Have others sort incidents using dimension names to determine agreement (retranslation).

Typical content piles (next page)


Development Issues

Sort Critical IncidentsToo much workUnpleasant workChanging workNew procedures are badChallenging workWork that is not usually

part of your job

Problematic bossProblematic co-workersProblematic subordinatesProblematic upper

management Problematic other

departments/vendorsProblematic customers


Development Issues


Goals of sorting: Identify dimensions for which item stems will

be written Identify duplicate or near duplicate critical

incidentsCheck on gaps in coverage


Development Issues

Sort Critical Incidents Goals…

Identify content that is inappropriate for items (content that you do not want to share with job applicants). For example:

Ethnic discrimination Workplace violence Topics that are sources of conflict within the

organization (crashing stock price, unpopular new policy)


Development Issues


Developing item stems from critical incidents is the next step.

This is labor intensive. If you will ultimately drop the stem due to

content, make the decision now so you do not waste time turning the critical incident into a stem.


Development Issues

Turn Critical Incidents into Item Stems

Write item stems using first part of critical incidents.

The stem needs to be appropriate and job- related for all jobs covered by the SJT.


Development Issues

Example from Critical Incident described above You are the lead scientist on a complex project that had

a very short timeline. You and your team have identified some creative solutions to an ongoing problem addressed by the project. When the project was nearly completed, your supervisor told you that someone in another agency was assigned the same project and it was completed last week.


Development Issues


For technical jobs, a critical incident may concern difficulty learning a new software package for inventory control.

If all jobs do not require the use of this software, make the stem refer to “new software for your job”.

If all jobs do not involve software, make the stem refer to “difficulty in learning a new work procedure.”


Development Issues


Stems need to be edited for clarity and brevity. Stems with ambiguous meanings will result in

disagreement concerning the effectiveness of the responses.

Standardize the use of terms (boss vs. supervisor, co-worker vs. team member, etc.) Making these decisions early will reduce editing time.


Development Issues


Assemble a survey of item stems with space for respondents to write potential responses to the stem.

The critical incident from which the stem was developed probably contained one response to the situation.


Development Issues


Have multiple subject matter experts write additional responses for each stem.

Prompts for writing responses:What would you do?What is the best thing to do?What is a bad response that you think many

people would do?Think of a good/poor employee. What would

he/she do?


Development Issues

Generate item responses You are the lead scientist on a complex project that had

a very short timeline. You and your team have identified some creative solutions to an ongoing problem addressed by the project. When the project was nearly completed, your supervisor told you that someone in another agency was assigned the same project and it was completed last week.

A. B. C. D.


Development Issues


Use multiple subject matter experts working independently to get the maximum number of non-redundant responses. A given subject matter expert will often only be able to

generate 2-3 non-redundant responses.

A group of subject matter experts working independently can usually generate between 5 and 8 non-redundant responses.


Development Issues


Edit item responses. Many of the item responses will be

redundant. Might permit some redundancy in

responses to convey a nuance:Confront your boss about X and …Assume X was a mistake and speak with your

boss …


Development Issues


Screen out responses that will have little variance. These will primarily be very inappropriate responses that no applicant will state they find effective:Tell boss you think his/her idea was stupid


Development Issues

Determine Item Response Instructions

One now has a set of items, each with multiple responses.

The next step is to determine the response instructions for the test.

Response instructions tell the respondent how to evaluate the item responses.

Choices are knowledge instructions or behavioral consistency.


Development Issues


Knowledge instructions ask for the “best” answer and are thus assessments of knowledge of the appropriateness of responses. Pick the best response.Pick the best response and then the worst

response.Rate the responses on effectiveness.


Development Issues


Behavioral tendency instructions ask for the applicant’s likely behavior.What would you most likely do?What would you most likely do and what

would you least likely do?Rate each response on how likely you would

do the response.


Development Issues


As noted earlier, whether one uses knowledge or behavioral tendency instructions has important implications for:Applicant fakingMagnitude of cognitive and non-cognitive

correlatesCriterion-related validityMagnitude of mean ethnic differences


Development Issues

Response Instructions and Faking Applicants may recognize that what they

would most likely do (Behavioral Tendency) is not the most effective response.

Some applicants may choose to misrepresent their behavioral tendency.

McDaniel keeps a messy desk. McDaniel will report that he would keep his desk clean and tidy.


Development Issues

Response Instructions and Faking

Nguyen, Biderman & McDaniel (2005) showed that it is more difficult to intentionally fake a knowledge item than a behavioral tendency item.

By way of metaphor, compare a personality item (behavioral tendency) to a math item (knowledge).

Behavioral tendency item: How dependable are you?

Knowledge item: What is the cube root of 46,656?


Development Issues

Response Instructions and Construct Validity (correlations with other tests) SJTs with knowledge instructions tend to

be more correlated with cognitive ability and less correlated with non-cognitive traits.

SJTs with behavioral tendency instructions tend to be more correlated with non-cognitive traits and less correlated with cognitive ability.


Development Issues

Response Instructions and Construct Validity

Knowledge Instructions

Behavioral Tendency Instructions

Cognitive ability .35 .19Conscientiousness .24 .34Agreeableness .19 .37Emotional stability .12 .35

From McDaniel, Hartman, Whetzel, Grubb (2007)


Development Issues

Response Instructions and Correlations with Job Performance

Meta-analysis resultsKnowledge instructions (ρ = .26; k = 96)Behavioral tendency instructions (ρ

= .26; k = 22)

From McDaniel, Hartman, Whetzel, Grubb (2007)


Development Issues

Response Instructions and Mean White-Black Differences in SJTs Mean differences are somewhat larger for

knowledge instruction SJTs than for behavioral tendency instruction SJTsAlmost all data are for written presentation

(non-video) SJTs


Development Issues

Response Instructions and Mean White-Black Differences in SJTs

Distribution k N d

All Effect Sizes 62 42,178 .38

Written – Knowledge 45 36,348 .39

Written – Behavioral tendency 17 5,830 .34

From Whetzel, McDaniel, Nguyen (2008)


Development Issues

Response Instructions and Mean White-Black Differences in SJTs The correlation of the SJT with cognitive

ability controls almost all of the differences across studies in mean white-black differences.


Development Issues

Response Instructions and Mean Ethnic Differences in SJTs Some employers may want to use a video

presentation format or use a behavioral tendency response format to reduce mean ethnic differences.This may help, but even these options are

driven somewhat by the cognitive loading and video-based SJTs can be expensive to develop.


Development Issues

Response Instructions and Mean Ethnic Differences in SJTs Readily comprehensible and simple

situations may have lower cognitive load regardless of presentation format or response instructionsBut, will simple items meet your assessment

needs?


Development Issues

Mean Sex Differences in SJTs

Females score slightly higher than males (d = .11) on average

The effect is not related to cognitive loading and is not moderated by response instructions

Some mean sex differences in conscientiousness and agreeableness favoring females.


Development Issues

Caveat on response instruction differences Most published data on SJTs are based

on employee samples (less likely to fake). Applicants are more likely to respond to

behavioral tendency instructions as if they were knowledge instructions.They will try to give the best answer when

asked for behavioral tendency Thus, need more research using applicant

samples.

Overview of SJT Scoring


Development Issues

Scoring

One needs to determine what the right answer is to build a scoring key.

The options are:Rational keysEmpirical keysHybrid keys


Development Issues

Scoring with Rational Keys

Rational keys SJTs are often keyed based on expert

judgmentReject item responses with low inter-rater

agreement


Development Issues


Data-assisted expert keyingCollect effectiveness data and have mean

and standard deviations and frequencies of ratings available to experts who decide the key


Development Issues


Data-assisted keying without expertsCollect effectiveness data and use the means

to make the keyDrop response options with high standard

deviations


Development Issues

Scoring with Empirical Keys

Any empirical keying approach for biodata is applicable for SJTs Score items according to their relationship

with a criterion variableSee Hogan (1994) for a good review of

empirical keying procedures.


Development Issues

Scoring with Hybrid Keys

A hybrid key is some mix of rational and empirical keying.

For example, you might empirically key but only retain the keyed option if it makes sense.


Development Issues

Scoring Issues

If one uses a Likert rating scale to record responses and uses a rational keying method, what do you do with the responses rated as average?

Suggestion: Make the Likert scales to have an even number of response categories (4 or 6).

This way all response options are forced to be either effective or ineffective (or likely to be performed or unlikely to be performed).


Development Issues

Scoring Issues

A Likert scale often uses adjectives:Very effective, effective, ineffective, very

ineffectiveFrom a litigation point of view, it makes some

uneasy to try to defend the difference between very effective and effective.

Your “very effective” might mean the same as my “effective”


Development Issues

Scoring Issues

For the purpose of rational keying, one might consider “very effective” and “effective” to be identical responses.

Thus, one could score the item as dichotomous. If the scoring key indicates that the response is a

good thing to do, a respondents providing a rating of “very effective” or “effective” gets a point; other ratings get zero.


Development Issues

Scoring Issues

Many applications of SJTs use discrete points assigned to response options:Very effective = 1Effective = 1 Ineffective = 0Very ineffective = 0


Development Issues

Scoring Issues

Legree et al. (2005) discussed the use of mean effectiveness ratings as the correct answer and score responses as deviations from the mean: If the mean is 1.5, a respondent who provided a rating

of 1 or 2 would both have a -.5 as a score on the item. Zero is the highest possible score Scores are often inverted to make favorable scores

positive.


Development Issues

Scoring Issues

Some research shows that mean ratings by experts give the same means as those given by novices.

The novices have greater standard deviations.


Development Issues

Scoring Issues

Differences across applicants in how they use the scale creates variance that might not be relevant to predicting the criterion.

Consider standardizing the item responses within person.Within-person z transformation


Development Issues

Scoring Issues

The within-person standardization makes the mean and standard deviation of all the item responses to be the same for all applicants.

This within-person standardization often improves the item validity substantially.


Development Issues

Scoring Issues

A few slides ago, we suggested dropping an item response when the experts could not agree on the effectiveness of the item response.

One could also consider dropping the item responses where the applicants have large variance in their responses.


Development Issues

Scoring Issues

When using Likert scales to rate each response option, the response options with large variances typically have mean ratings in the middle of the Likert scale.

Thus, mid-range mean items may indicate an item response where there is much disagreement concerning its effectiveness.


Development Issues

Scoring Issues


Development Issues

Scoring Issues

The response options with high or low mean Likert ratings often have the highest mean item validities.


Development Issues

Scoring Issues


Development Issues

Scoring Issues

Although the low and high mean items will likely have the highest validity, the mid-mean items may have some validity.

If one wants to shorten the test for future administrations, consider dropping the mid-mean items.


Development Issues

Scoring Issues

Incumbent vs. applicant differences Incumbents are typically the experts for keying. If a company policy guides an action,

incumbents will rate behaviors consistent with the policy as effective.

High quality applicants might respond differently because they don’t know the policy.

Call center example


Development Issues

Scoring Issues

Cultural differences are possible with keysAmericans will often question their

supervisor’s judgment openly.Other cultures do not typically question their

supervisor’s judgments.Your believe your boss has overlooked a key

fact when making a decision…


Summary of SJT Development and Scoring Best developed using critical incidents. Carefully edit the critical incidents into item

stems. Use several subject matter experts to

suggest item responses. Drop response options when subject

matter experts cannot agree.


Summary of SJT Development and Scoring Have the applicants rate each response

on a Likert scale of effectiveness.A knowledge instruction

Use a within-person standardization on the applicant ratings.This tends to result in higher validities

To shorten SJTs for future administrations, drop the mid-range mean items.


References

Provided in book chapter.


Thank you.

Questions??

meta-analysis of personnel selection tests & overview of situational judgment tests michael a....

Documents