true or false? 1 running head: interrater...

True or False? 1

Running Head: INTERRATER RELIABILITY AND AGREEMENT

True or False?: Different Sources of Performance Ratings Don’t Agree

James M. LeBreton, Jennifer R.D. Burgess, E. Kate Atchley

The University of Tennessee

Robert B. Kaiser

Kaplan DeVries, Inc.

Lawrence R. James

The University of Tennessee

True or False? 2

True or False?: Different Sources of Per formance Ratings Don’t Agree

The use of 360-degree feedback has become widespread in management development

activities (Church & Bracken, 1997). The basic idea is that performance ratings are collected

from different rating sources such as superior, peer, direct report, and self. The ratings furnished

by these sources are then examined to understand what (if any) are the distinct views these

sources have regarding the target manager’s performance. Indeed, most prescriptions for the use

of 360-degree data provide for a heavy focus on rating source discrepancies (Dalessio, 1998).

However, 360-degree measures are designed to measure traits believed to underlie

managerial behavior. A trait may be defined as “a disposition or tendency to behave in a

relatively consistent manner over time and across diverse situations” (James & Mazerolle, in

press). Application of this definition to 360-degree appraisal engenders an interesting paradox. If

360-degree appraisals measure cross-situationally consistent traits, then by definition, there

should be fairly strong agreement among observers (i.e., across space and time parameters).

Unfortunately, observed correlations among raters tend to be modest at best (cf. Conway &

Huffcutt, 1997; Harris & Schaubroeck, 1988) suggesting a lack of consistency of behavior. Two

theoretical models may be offered to explain this apparent inconsistency.

The first model will be referred to as the "discrepancy model." This model assumes either

that raters have differential access to behaviors exhibited by the target manager or that the target

manager behaves differently in the presence of different raters. Though semantically similar,

these two assumptions are qualitatively different. In the first instance, the manager engages in

stable trait behavior but only certain raters are privy to these behaviors. For example, the

manager may routinely confront problems with coworkers. However, the majority problems to

be confronted deal with direct reports rather than peers or supervisors. Additionally, when

True or False? 3

addresses problems the target elects to do so in private, rather than public. In this instance,

ratings on the "confrontation" dimensions of managerial effectiveness would like appear as very

discrepant among rating sources (i.e., direct reports vs. peers) and even potentially within rating

sources (e.g., direct reports).

The alternative assumption underlying the discrepancy model is that managers selectively

engage in various behaviors, contingent on with whom they are interacting. For example, a

manager may be compassionate and sensitive when dealing with an upset direct report, but may

be harsh and insensitive when dealing with peers. In this instance, ratings on the "sensitivity"

dimension of managerial effectiveness would again be discrepant, but for a different reason. In

either event, much of the extant empirical research appears, at first glance, to support the notion

of rating source discrepancies.

A meta-analysis by Harris and Schaubroeck (1988) revealed low to moderate correlations

among rating sources. Specifically, correlations between self, peer, and supervisor ratings

(corrected for only sampling error) ranged from .22 to .48. A similar unpublished manuscript by

Kraiger (1986) revealed low convergent correlations among rating sources by various

performance dimensions (e.g., interpersonal skills, job aptitude; mean r's ranged from .07 to .52).

Similar values were reported for convergent intraclass correlations (ICCs) among rating sources

(mean ICCs ranged from .18 to .35). The interested reader is directed to Murphy and Cleveland

(1995) and Cardy and Dobbins (1994) for additional references regarding rating source

discrepancies.

To summarize, the discrepancy model suggests that different rating sources provide useful

and substantially unique information regarding how a target manager is perceived. Additionally,

within the context of 360-degree feedback, target managers may actualize greater improvement

True or False? 4

in performance by identifying and improving upon 1) areas rated uniformly low across sources,

and/or 2) areas rated low within only a specific rating source.

Though the discrepancy model provides a provocative account for observed rating

discrepancies, the purpose of this paper is to introduce an alternative explanation for these

discrepancies. This alternative model is labeled the "restriction of variance model." The

restriction of variance model contains several constraining assumptions. First, it assumes that

target managers engage in (fairly) consistent "trait like' behaviors across space, time, and person

parameters. Second, this model assumes that extant HR, OB, and I/O interventions have been at

least minimally effective; and, due to these interventions the variance in performance ratings is

necessarily attenuated or restricted. The following sample is used to elucidate this second

assumption.

Sample Scenario

Take for instance an individual is admitted to a 4-year university largely based upon her

performance on the Scholastic Aptitude Test. After her first semester, she goes to the counseling

center and completes a series of vocational interest inventories. These inventories suggest the

student has an aptitude for, and interest in, accounting. She agrees and enrolls in the B.S.

program in accounting. Three years latter she completes her degree and the Certified Public

Accounting Exam. After meeting with a campus recruiter she is invited to interview with a Big 6

accounting firm. After she satisfactorily passes a personality test measuring employee reliability,

she goes through several structured panel interviews and is offered employment with the firm.

She accepts the offer. Four years later, and still with the firm, she is accepted into an Executive

MBA program. As part of this program she goes through a developmental assessment designed

at targeting her strengths and weaknesses. For the next 18 months she meets with an executive

True or False? 5

coach to outline and implement a plan designed at leveraging her strengths and improving upon

her weaknesses. At the close of the program, two sets of data are collected from, one contains

two supervisors' ratings and the other contains two peers' ratings of her managerial performance.

If these data were combined with data from other EMBAs with similar backgrounds, what would

we expect the correlation to be between the two sets of ratings? Within each set of ratings?

If even one-half of the I/O related activities and interventions described above were

marginally effective, then we would expect a modest correlation at best. This weak correlation

may not be due to lack of consensus or inconsistency among raters, but rather because the

performance levels of the targets are highly constrained (i.e., after all this training, education,

and development all targets are fairly effective). If this is the case, then indices of rating

similarity assessed via correlational analysis may be inappropriate and alternative indices may

provide a more realistic assessment of rating similarity.

Rating similarity indexed via correlational analyses presupposes that variance exists

between targets. That is, for correlations to be high among multiple raters, the targets of these

ratings must vary (substantially) in terms of their manifest performance. This is easily illustrated

with data in Table 1. When one-way random effects intraclass correlation coefficients (cf.

Bartko, 1976; Shrout & Fleiss, 1979) are computed separately on the first 10 targets (good

performers) and the last 10 targets (poor performers) each ICC (1,1) = .22. When the data are

combined and a single analysis is conducted, the ICC (1,1) = .92. These data are contrived to

illustrate a simple point -- without between-target variance, correlational-based analyses are

ineffectual indicators of rating similarity.

The purpose of this paper is to demonstrate that the restriction of variance model posited

earlier represents a viable alternative to the discrepancy model traditionally embraced by

True or False? 6

researchers and practitioners working in the area of performance evaluation and leadership

development. Below we offer several general predictions based upon these two competing

models, and then derive specific hypotheses for the restriction of variance model based upon

these predictions.

Predictions Derived from the Discrepancy Model

The discrepancy model posits that raters have differential access to behaviors exhibited by

the target manager or that the target manager behaves differently in the presence of different

raters. In either event, this model predicts that raters within-source will provide similar ratings

but raters in different sources will provide dissimilar ratings. Thus, indexing rating similarity

within-sources via intraclass or interclass correlations should result in moderate levels of rating

similarity. However, the application of these statistics to between-source rating similarity should

result in substantially lower values. Likewise, indexing similarity via agreement indices such as

rWG (James, Demaree, & Wolf, 1984; 1993) should result in fairly high within-source rating

similarity. However, the application of this same statistic to assess between-source similarity

should result in substantially lower values.

Predictions Derived from the Restriction of Variance Model

The restriction of variance model posits that target managers engage in stable trait

behaviors and that extant I/O practices and interventions have been somewhat effective. Due to

the effectiveness of these interventions between-target variance in performance is restricted or

attenuated. Thus, this model predicts that ratings within-source will be similar and ratings

between-sources will also be similar. However, due to the restriction in between-target

variability, indexing similarity within- and between-sources via intraclass or interclass

correlations will likely result in low levels of rating similarity. Indexing similarity via agreement

True or False? 7

indices such as rWG (James, Demaree, & Wolf, 1984; 1993) should result in fairly high within-

source and between-source rating similarity. This high agreement is a direct function of the fact

that rWG is not a function of rank ordering among targets and does not require substantial

between-target variance. Rather, rWG examines the extent to which multiple ratings on a single

target deviate (proportionally) from some null distribution. Based on the aforementioned

discussion, we offer the following research hypotheses to test the restriction of variance model:

Hypothesis 1: 360-degree scales will have a restricted range as observed by negatively

skewed and kurtotic scale scores.

Hypothesis 2: Intraclass and interclass correlations will be low (< .70) for all dyadic

between-source comparisons (e.g., peer-boss).

Hypothesis 3: rWG values will be high (>.70) for all dyadic between-source comparisons

(e.g., direct report - peer).

Method

Sample

The data for this study were collected from managers who participated in various leadership

development courses offered at a non-profit training institute between 1992 and 1997. Target

managers were asked to fill out a self-rating form of a 360-degree feedback questionnaire as well

as to request several coworkers to complete an observer form prior to the course. The target

managers represented firms in the manufacturing, financial services, insurance, wholesale trade,

transportation, communications, and utility industries. Data were collected from a total of 8,434

target managers attending a leadership development program between 1992 and 1996.

Performance ratings to be used for developmental feedback were made available from a total of

7,606 supervisors, 34,511 peers, and 31,731 subordinates relative to the target managers.

True or False? 8

Measure

Multi-rater feedback data was collected with Benchmarks (Lombardo & McCauley, 1994),

a 360-degree feedback instrument developed from a series of studies conducted to further the

understanding of executive development. By focusing on the experiential growth aspects of top-

level management careers, it is designed to measure those characteristics critical to managerial

success that are learned over time. The research leading up to the development of Benchmarks is

summarized by McCall, Lombardo, and Morrison, (1988) and Lindsey, Homes, and McCall

(1987). An overview of the content considerations and construction of the instrument can be

found in McCauley and Lombardo (1990).

Benchmarks contains two general sections containing a total of 164 items. For the purposes

of the present study, only the scales in the first section were used (c.f. Brutus et al., 1996;

Fleenor et al., 1996). This section is the more reliable of the two sections (Zedeck, 1995; also

corroborated with local data) and spans 16 conceptually and empirically derived subscales made

up of a total of 106 items concerning the learned skills thought to contribute to success as an

executive. The 16 scales are labeled Resourcefulness, Doing whatever it takes, A quick study;

Building and mending relationships, Leading subordinates, Compassion and sensitivity; Straight-

forwardness and composure, Setting a developmental climate, Confronting problem

subordinates, Team orientation, Balance between personal life and work, Decisiveness, Self-

awareness, Hiring talented staff, Putting people at ease, and Acting with flexibility.

Benchmarks items consist of a stem containing a behavioral description of a specific skill,

competency, or capacity, and the focal manager is rated on a five-point scale (low to high) for

the degree to which the item is characteristic of him or her. This instrument has been subjected to

multiple validation efforts (for a review, see McCauley & Lombardo, 1990) and has received

True or False? 9

favorable overall evaluations as a reliable and valid measure of important aspects of leadership

related to executive development (e.g., Zedeck, 1995). Internal consistency reliability

coefficients for the 16 scales, estimated at the individual rater level of analysis are presented in

Table 2. As this table shows the present sample, generally exceeded recommended standards for

internal consistency (cf. Nunnally, 1978).

Procedures

Both inter-rater agreement and reliability for Benchmarks ratings were assessed to justify

aggregating coworker ratings at the scale level within each source (as recommended by Fleenor,

Fleenor, & Grossnickle, 1996; and Tinsley & Weiss, 1975). For these purposes, three separate

databases—one for superior ratings, one for peer ratings, and one for subordinate ratings—were

independently constructed (all with N = 500). Each database contained a random sampling of

coworkers representing the respective perspective where exactly two superiors, four peers, or

four subordinates provided ratings.

James’ rWG index was used to determine the level of rater agreement (James, Demaree, &

Wolf, 1984; 1993). This statistic is appropriate when a group of raters rate a single target on

several variables thought to be indicators of the same construct and the researcher wants to know

the extent to which the overall level of ratings is similar across the raters. Similar to indices of

reliability, values closer to 1 indicate higher convergence. Separate rWG values were computed

for each individual rating target on each of the 16 Benchmarks scales within each rating source.

Next, the mean rWG across targets for each scale was computed within the three rating

perspectives. The mean level of agreement across the three sources for all scales was high

according to James’ criteria (James et al., 1993): for superior ratings the median scale mean rWG

value was .91 (range .84 to .98), for peer ratings it was .89 (range = .86 to .98), and for

True or False? 10

subordinate ratings it was .86 (range = .81 to .97).

Benchmarks scale ratings were assessed for inter-rater reliability with intraclass correlations

(ICC; Shrout & Fleiss, 1979). Whereas rWG is an index of level of agreement, ICCs provide an

assessment of agreement in terms of rank-ordering (Fleenor et al., 1996). ICCs were computed

separately for the three rating sources. For each of the 16 scales, the reliability of the mean of the

two superiors’ ratings and the reliabilities of the means of the four raters within the peer and

subordinate groups were calculated (ICC [1,2] for superiors and ICC[1,4] for peer and

subordinates; Shrout & Fleiss, 1979). For superiors, these coefficients ranged from .48 to .73 (M

= .64); for peers they ranged from .46 to .72 (M = .61); and they ranged from .48 to .75 (M = .62)

for subordinates. Taken together, the inter-rater agreement and reliability estimates were deemed

high enough to justify aggregating scale scores across raters within each source. Additionally,

given our interest in identifying between-source rating similarity or differences, the extent that

within-source variance may be reduced or obviated enhances the likelihood of identifying these

between-source trends.

Results

Hypothesis 1 - Supported

Table 3 reports the pattern of skewness and kurtosis for supervisor ratings (the other groups

had a nearly identical pattern). The scale distributions are negatively skewed and kurtotic. These

distributions represent prima facie evidence of leniency effects (or biases) among rating sources

(cf. Cardy & Dobbins, 1994; Murphy & Cleveland, 1995). That is, raters (or aggregated raters)

within sources assigned "inflated" scores to the target manager. An alternative explanation (and

one not necessarily implying biased ratings) is that these executives and managers tended to

legitimately score at the top end of the performance distributions. The veridicality of each

True or False? 11

interpretation is subject to debate. For the current study, the key finding, and consistent with

Hypothesis 1, is that observed ratings tended to cluster at the high end of the performance

distributions.

Hypotheses 2 - Supported

To test Hypothesis 2, scores on the 16 Benchmark scales were analyzed for each of six

between-source dyadic comparisons (i.e., self-boss, self-peer, self-direct report, boss-peer, boss-

direct report, and peer-direct report). It is important to reemphasize that scores on the 16 scales

were aggregated within the peer and direct report rating sources prior to comparisons. Thus, in

some instances individual-level ratings are being compared to individual-level ratings (i.e., self-

boss), individual-level ratings are being compared to mean-level ratings (i.e., self-peer, self-

direct report), and mean-level ratings are being compared to mean-level ratings (i.e., peer-direct

report). By aggregating data within source we hoped to minimize the amount of within-source

variance, thus providing us with a better opportunity to identify between group-source

differences or lack thereof.

Results are presented in Tables 4-9. To formally evaluate Hypothesis 2 we computed both

one-way random effects intraclass correlation coefficients (Bartko, 1976; Shrout & Fleiss, 1979)

and Pearson r interclass correlation coefficients. These are the statistics typically reported in

literature examining between-source differences in performance ratings (cf. Harris &

Schraubroeck, 1988; Kraiger, 1986). The pattern of results is consistent across each dyadic

comparison. In general, the between-source ICCs and Pearson r's for each scale were low. In

none of cases did the ICC (1,1), and in only 5 of 96 instances did the ICC (1,2) exceed the

minimum accepted reliability threshold of .70. Similarly, nearly all of the between-source

interclass correlations were low.

True or False? 12

These results suggest that rating similarity, as indexed via correlational analyses was low

across all six between-source comparisons. One possible explanation for these low values was

the restricted variance observed for all 16 scales. In this instance, one might argue that the

observed values are "artificially" deflated due to restriction of variance. Alternatively, these low

values may be a function of true differential response patterns by the different rating sources. In

this instance, one might argue that the observed values accurately reflect the lack of rating

similarity across sources. Hypothesis 3 sought to more fully explore these alternative

explanations.

Hypothesis 3 - Supported

To test Hypothesis 3, scores on the 16 Benchmark scales were again analyzed for each of

six between-source dyadic comparisons. It is important to again note that in several instances

individual level data were compared to aggregated data. First we computed rWG statistics using

the computational formula offered by James, et al. (1984). The basic idea underlying rWG is that

the observed variance among multiple judges is compared to the variance one would obtain from

one or more "null" distributions. Though the uniform or rectangular distribution is the most

common, James et al. provide information about how to model distributions with various

response biases (e.g., leniency bias) and encourage researchers to use such distributions

whenever they suspect their data might be impacted by such effects. Thus, in the interest of

trying to provide an accurate (i.e., uninflated) estimated of between-source rating agreement, we

elected to compute rWGs based upon null distributions containing both a slight skew (σ2 = 1.33

for a 5-point scale) and a moderate skew (σ2 = 0.90). Finally, for comparison purposes we

computed rWGs using the rectangular or uniform distribution (σ2 = 2.0).

True or False? 13

rWGs were computed for each dyadic comparison for each target manager. In the rare

instances when rWGs were negative these values were reset to zero in accordance with

recommendations offered by James et al. (1984). Next we computed mean rWGs across all targets

for each of the 16 Benchmarks scales. These are the values reported in Tables 4 -9. To elucidate

the process further, let us provide a more concrete example. For the comparison between peer

and boss ratings we first aggregated peer ratings for each target manager. Next we computed rWG

values based upon the observed variance between average peer ratings and boss ratings for each

of roughly 6,500 target managers on the 16 scales. Finally, we computed mean rWGs for the 16

scales collapsing across the approximately 6,500 targets.

The results based on the uniform or rectangular null response distribution suggested strong

between-source rating similarity. As expected, the values of the obtained rWGs were lower for the

slightly skewed and moderately skewed null distributions. However, even basing rWGs on a

moderately skewed distribution, the obtained mean rWGs were of sufficient magnitude to suggest

strong between-source agreement for all six dyadic comparisons. These results taken as a whole

suggest that rating similarity as indexed via rWG analyses was high across all between-source

comparisons. These results coupled with the low ICCs obtained in Hypothesis 2 provide initial

support for the restriction of variance model.

Discussion

Summary

The purpose of this paper was to offer an alternative explanation for the discrepancies often

reported between different sources of performance ratings. This explanation labeled the

"restriction of variance hypothesis" suggests that the low intraclass correlations and Pearson

correlations are largely a function of restricted variance in the performance ratings, possibly

True or False? 14

engendered by I/O interventions such as vocational counseling, selection, training, performance

evaluation, culture and climate interventions, and/or other person-environment fit issues (cf.

Kristof, 1996; Schneider, Goldstein, & Smith, 1995).

Results supporting Hypothesis 1 replicated earlier work on performance ratings

demonstrating the 16 Benchmarks scales tended to be kurtotic and negatively skewed. Results

supporting Hypothesis 2 indicated that between-source (e.g., self-boss) intraclass correlation

coefficients tended to be low, suggesting lack of rating similarity across sources. However,

supporting the restriction of variance hypothesis, results for Hypothesis 3 demonstrated high

levels of between-source rating similarity when similarity was indexed via James, et al.'s (1984)

rWG statistic.

These contrasting conclusions may be best explained by distinguishing interrater reliability

from interrater agreement. The former is concerned with the consistency in target rank-orders as

assigned by different raters, while the latter is concerned with consensus in the scores assigned

by different raters. Thus, under conditions where the range of targets has been attenuated or

restricted, interrater reliability coefficients such as ICCs will necessarily be low, irrespective of

whether in the unrestricted population ICCs are high. Restricted samples, however, do not

negatively impact the interrater agreement values obtained via rWG. Additional analyses are

underway using a Monte Carlo methodology varying the levels of selection ratios and empirical

validities and examining these effects on interrater reliability (ICC) and interrater agreement

(rWG) indices. Initial results from these analyses further support the restriction of variance

hypothesis.

Intraclass Correlations, Interclass Correlations, and Criterion Reliability

True or False? 15

We believe that I/O psychologists have potentially placed too much stock in rank-order

based indices of rating similarity such as the intraclass correlation and Pearson’s interclass

correlation. This is especially true for researchers working in the areas of validity generalization

(VG) and meta-analysis. For instance, in a recent review by Ones, Viswesvaran, and Schmidt

(1993) the authors concluded that the average reliability of supervisory ratings is a .52. Stated

alternatively, roughly 50% of the variance in supervisory performance ratings is random error

variance! We sincerely doubt this to be the case. On the contrary we believe the .52 estimate

represents a lower bound estimate that has been grossly attenuated due to restriction of variance

in performance.

Indeed, if researchers insist upon using intraclass and interclass correlations as indices of

rating similarity and reliability and inputting these values into correction formulae, they should at

least correct these values for attenuation due to restriction of range. Such correction formulae

have been available to psychologists for years and have been recognized by the leading VG

researchers. Indeed, Schmidt, Hunter, and Urry (1976) note, "In the typical validation study, the

criterion reliability, as well as the test validity, is available only on the restricted group. Both

coefficients should be corrected first for restriction of range" (p. 475). We agree with Schmidt et

al. and in the future we would like to see such corrections for range restriction routinely included

in validation and meta-analytic studies.

360-Degree Feedback and Rating Source Discrepancies

The results of the current study suggest that individuals working in the area of 360-degree

feedback and leadership development may do well to temper their enthusiasm for “discrepancy”

type models of leader behavior. The current study suggests that different sources of ratings, that

were at times aggregated within-source, tended to agree with other sources of ratings. The levels

True or False? 16

of between-source rating agreement were similar to previous studies on this data examining

within-source rating agreement (cf. Kaiser, 1998). Stated alternatively, it appears that the

between-source ratings are no more "discrepant" than ratings obtained within-source. An

important caveat to note is that neither within-source nor between-source rating agreement every

legitimately approached 1.0. Perhaps, true with- and between-source rating discrepancies still

exist, however based upon the results of the current study we believe these effects to be small.

Conclusion

Ratings between sources showed substantial consensus or agreement on the level (rWG) of

target managers’ performance. Observed correlations between sources showed much less

consistency in the rank-ordering (ICC) of performance. This suggests that different sources of

ratings show much greater similarity than previous research has indicated. The logic behind 360-

degree feedback rests upon the idea that different rating groups provide unique performance

information (Dalessio, 1998). This argument has been supported with research showing the

moderate-at-best correlations between rating sources. The present results question the extent and

magnitude of discrepancies in ratings among different sources (see also, Mount et al., 1998). To

be sure, there is a degree of distinctiveness between sources—their correlations and rWGs are not

unity. But the degree of uniqueness appears to be less extreme than the pundits of 360-degree

feedback have claimed.

True or False? 17

References

Conway, J., & Huffcutt, A. (1997). Psychometric properties of multi-source performance

ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance,

10, 331-360.

Dalessio, A. T. (1998). Using multi-source feedback for employee development and

personnel decisions. In J. W. Smither (Ed.), Performance Appraisal. San Francisco: Jossey Bass.

Fleenor, J. W., Fleenor, J. B., & Grossnickle, W. F. (1996). Interrater reliability and

agreement of performance ratings: A methodological comparison. Journal of Business and

Psychology, 10, 367-380.

Fleenor, J.W., McCauley, C.D., & Brutus, S. (1996). Self-other rating agreement and leader

effectiveness. Leadership Quarterly, 7, 487-506.

Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and

peer-supervisor ratings. Personnel Psychology, 41, 43-62.

James, L. R., Demaree, R., G. & Wolf, G. (1984). Estimating within-group interrater

reliability with and without response bias. Journal of Applied Psychology, 69, 85-98.

James, L. J., Demaree, R. G., & Wolf, G. (1993). rWG: An Assessment of within-group

interrater agreement. Journal of Applied Psychology, 78, 306-309.

James, L. R., & Mazerolle, M. (in press). Personality in work organizations: An integrative

approach. Sage Publications.

Kaiser, R.B. (1998). Can the MBTI be used to understand self-other rating agreement in

360o feedback? Unpublished master's thesis, I llinois State University, Normal, IL.

Kristof, A. L. (1996). Person-organization fit: An integrative review of its

conceptualizations, measurement, and implications. Personnel Psychology, 49, 1-49.

True or False? 18

Lombardo, M. & McCauley, C.D. (1994). Benchmarks: A manual and trainer’s guide.

Greensboro, NC: Center for Creative Leadership.

McCauley, C. D. & Lombardo, M. (1990). Benchmarks: An instrument for diagnosing

managerial strengths and weaknesses. In K. E. Clark & M. B. Clark (Eds.), Measures of

Leadership. West Orange, NJ: Leadership Library of America.

Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait,

rater and level effects in 360-degree performance ratings. Personnel Psychology, 51, 557-576.

Ones, D. S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of

integrity test validities: Findings and implications for personnel selection and theories of job

performance. Journal of Applied Psychology, 78, 679-703.

Shrout, P. E. & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater

reliability. Psychological Bulletin, 86, 420-428.

Schneider, B., Goldstein, H. W., & Smith, D. B. (1995). The ASA framework: An update.

Personnel Psychology, 48, 747-773.

Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related

validation studies. Journal of Applied Psychology, 61, 473-485.

Zedeck, S. (1995). [Review of Benchmarks]. In J. Conoley & J. Impara (Eds.), The twelfth

mental measurements yearbook (Vol. 1, pp 128-129). Lincoln, NE: Buros Institute of Mental

Measurements.

True or False? 19

Table 1

I llustration of restriction of variance upon the computation of intraclass correlations.

Target Rating 1 Rating 2 1 5 5 2 4 4 3 5 4 4 4 5 5 4 4 6 5 5 7 4 4 8 5 4 9 4 5 10 4 4 11 2 2 12 1 1 13 2 1 14 1 2 15 1 1 16 2 2 17 1 1 18 2 1 19 1 2 20 1 1

ICC(1,1) = 0.22 based upon first 10 targets

ICC(1,1) = 0.22 based upon second 10 targets

ICC(1,1) = 0.92 based upon all 20 targets

Mean rWG(1) =.85 across all 20 targets

True or False? 20

Table 2

Internal consistency (Coefficient Alpha) estimates for 16 Benchmarks scales.

Scale Title Overall Boss Peer DR Self

1 Resourcefulness 0.92 0.91 0.93 0.93 0.85

2 Doing Whatever it Takes 0.90 0.89 0.90 0.91 0.82

3 Being a Quick Study 0.86 0.86 0.86 0.86 0.81

4 Decisiveness 0.79 0.82 0.80 0.77 0.76

5 Leading Employees 0.91 0.90 0.91 0.91 0.91

6 Setting Developmental Climate 0.83 0.79 0.83 0.85 0.71

7 Confronting Problem Employees 0.81 0.82 0.81 0.81 0.77

8 Work Team Orientation 0.76 0.79 0.77 0.75 0.70

9 Hiring Talented Staff 0.80 0.82 0.82 0.78 0.76

10 Building/Mending Relationships 0.92 0.91 0.93 0.93 0.83

11 Compassion/Sensitivity 0.81 0.78 0.82 0.81 0.65

12 Straightforwardness and Composure 0.74 0.73 0.76 0.75 0.63

13 Balance Between Personal- and Work

Life

0.82 0.78 0.81 0.83 0.84

14 Self-Awareness 0.83 0.82 0.83 0.84 0.65

15 Putting People at Ease 0.89 0.88 0.89 0.89 0.82

16 Acting with Flexibility 0.83 0.79 0.83 0.84 0.68

Mean 0.84 0.83 0.84 0.84 0.76

Median 0.83 0.82 0.83 0.84 0.77

True or False? 21

Table 3

Skewness and kurtosis of supervisor ratings of target managers for the 16 Benchmarks scales.

Skew Skew Kurtosis Kurtosis N Mean StDev Statistic Std. Error t-value Statistic Std. Error t-value

Resourcefulness 7510 3.6449 .5350 -.412 .028 -14.71 .347 .057 6.09

Doing Whatever it Takes 7512 3.7956 .5539 -.491 .028 -17.54 .447 .057 7.84

Being a Quick Study 7465 4.0114 .6342 -.549 .028 -19.61 .433 .057 7.60

Decisiveness 7499 3.5987 .7843 -.406 .028 -14.50 -.280 .057 -4.91

Leading Employees 7283 3.5623 .5725 -.370 .029 -12.76 .162 .057 2.84

Setting Developmental Climate 7260 3.7406 .5901 -.498 .029 -17.17 .556 .057 9.75

Confronting Problem Employees 6677 3.3353 .7582 -.292 .030 -9.73 -.257 .060 -4.28

Work Team Orientation 7383 3.5443 .7350 -.409 .029 -14.10 -.135 .057 -2.37

Hiring Talented Staff 6747 3.666 .676 -.375 .030 -12.50 .202 .060 3.37

Building/Mending Relationships 7510 3.5949 .6676 -.470 .028 -16.79 -.009 .057 -0.16

Compassion/Sensitivity 7216 3.6846 .6361 -.470 .029 -16.21 .468 .058 8.07

Straightforwardness and Composure

7498 4.1072 .6279 -.891 .028 -31.823 .811 .057 14.23

Balance Between Personal- and Work Life

7185 3.8116 .7454 -.749 .029 -25.83 .368 .058 6.35

Self-Awareness 7443 3.5421 .7207 -.565 .028 -20.18 .239 .057 4.19

Putting People at Ease 7514 3.8288 .7872 -.478 .028 -17.07 -.152 .056 -2.71

Acting with Flexibility 7506 3.6032 .6249 -.434 .028 -15.50 .246 .057 4.32

True or False? 22

Table 4

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for self-boss ratings.

Self-Boss mean mean mean

N ICC(1,1) ICC(1,2) Pearson rwg-un-nn rwg-ss-nn rwg-ms-nn SCALE01 Resourcefulness 6826 0.1927 0.3231 0.2110 0.9120 0.8703 0.8174 SCALE02 Doing Whatever it Takes 6825 0.2775 0.4344 0.2900 0.9099 0.8676 0.8137 SCALE03 Being a Quick Study 6776 0.1905 0.3200 0.2150 0.8545 0.7921 0.7194 SCALE04 Decisiveness 6815 0.3563 0.5254 0.3720 0.8408 0.7757 0.7018 SCALE05 Leading Employees 6604 0.1463 0.2553 0.1880 0.8931 0.8441 0.7845 SCALE06 Setting Developmental Climate 6594 0.1437 0.2513 0.1620 0.8817 0.8307 0.7698 SCALE07 Confronting Problem Employees 6007 0.2573 0.4093 0.2600 0.8222 0.7515 0.6724 SCALE08 Work Team Orientation 6712 0.2616 0.4148 0.3290 0.8338 0.7675 0.6930 SCALE09 Hiring Talented Staff 6058 0.1987 0.3316 0.2420 0.8441 0.7800 0.7055 SCALE10 Building/Mending Relationships 6825 0.2811 0.4388 0.3150 0.8855 0.8336 0.7709 SCALE11 Compassion/Sensitivity 6567 0.2385 0.3851 0.2440 0.8731 0.8171 0.7500 SCALE12 Straightforwardness and Composure 6816 0.1265 0.2246 0.1430 0.8679 0.8102 0.7397 SCALE13 Balance Between Personal- and Work Life 6542 0.3273 0.4932 0.3880 0.8040 0.7320 0.6535 SCALE14 Self-Awareness 6764 0.0806 0.1492 0.1110 0.8327 0.7687 0.6980 SCALE15 Putting People at Ease 6828 0.3977 0.5690 0.4060 0.8440 0.7793 0.7044 SCALE16 Acting with Flexibility 6820 0.1131 0.2032 0.1440 0.8695 0.8133 0.7475

Grand Mean 0.2243 0.3580 0.2513 0.8605 0.8021 0.7338 Grand StDev 0.0912 0.1212 0.0920 0.0312 0.0404 0.0484

Note. rWG-u-nn is the rWG based upon the uniform distribution resetting negative values to zero. rWG-ss-nn is the rWG based upon the slightly skewed distribution resetting negative values to zero. rWG-ms-nn is the rWG based upon the slightly skewed distribution resetting negative values to zero.

True or False? 23

Table 5

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for self-direct report ratings.

Self-DR mean mean mean




True or False? 24

Table 6

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for self-peer ratings.

Self-Peer mean mean mean




True or False? 25

Table 7

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for boss-direct report ratings.

Boss-DR mean mean mean




True or False? 26

Table 8

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for boss-peer ratings.

Boss-Peer mean mean mean




True or False? 27

Table 9

Intraclass correlation coefficients, interclass (Pearson) correlation coefficients, and rWGs for peer-direct report ratings.

Peer-DR mean mean mean




true or false? 1 running head: interrater...

Documents