kessler intro who hpq 2003 - bdms cme online

19
The World Health Organization Health and Work Performance Questionnaire (HPQ) Ronald C. Kessler, PhD Catherine Barber, MPA Arne Beck, PhD Patricia Berglund, M.BA Paul D. Cleary, PhD David McKenas, MD Nico Pronk, PhD Gregory Simon, MD Paul Stang, PhD T. Bedirhan Ustun, MD Phillip Wang, MD, ScD This report describes the World Health Organization Health and Work Performance Questionnaire (HPQ), a self-report instrument designed to estimate the workplace costs of health problems in terms of reduced job performance, sickness absence, and work-related accidents- injuries. Calibration data are presented on the relationship between individual-level HPQ reports and archival measures of work perfor- mance and absenteeism obtained from employer archives in four groups: airline reservation agents (n 441), customer service representatives (n 505), automobile company executives (n 554), and railroad engineers (n 850). Good concordance is found between the HPQ and the archival measures in all four occupations. The paper closes with a brief discussion of the calibration methodology used to monetize HPQ reports and of future directions in substantive research based on the HPQ. (J Occup Environ Med. 2003;45:156 –174) I nterest in the social consequences of illness has broadened in the past decade as epidemiologists have joined health economists and health services researchers to devise meth- ods that rationalize the allocation of health care resources. 1,2 Research showing that untreated and under- treated health problems exact sub- stantial personal costs from the indi- viduals who experience them as well as from their families, employers, and communities has been a central part of this work. 3,4 Among the most important of these results have been those concerning the workplace costs of illness from the perspective of the employer. 5,6 These costs have enor- mous implications for the economy. For example, a recent analysis esti- mated that depression causes an an- nual loss of $33 billion in work absenteeism in the U.S. 7 Given the low rate of depression treatment 8 and the fact that treatment substantially improves role functioning among people with depression, 9,10 such data suggest that it might be cost- effective for employers to increase the proportion of depressed workers who receive treatment. 11 Similar ar- guments have been made for a num- ber of other illnesses, 12–14 the notion being that targeted expansion of em- ployee health care benefits, including an outreach component, can repre- sent an investment opportunity for employers. Only a small minority of employ- ers has as yet been convinced of the business case for targeted investment in employee health care. This is partly a result of the absence of data to evaluate the indirect costs of un- From Harvard Medical School, Boston, Massachusetts (Dr Kessler, Dr Cleary, Dr Wang); Harvard School of Public Health, Boston, Massachusetts (Ms Barber); Kaiser-Permanente, Denver, CO (Dr Beck); University of Michigan, Ann Arbor, MI (Ms Berglund); American Airlines, Fort Worth, TX (Dr McKenas); HealthPartners, Minneapolis, MN (Dr Pronk); Center for Health Studies, Group Health Cooperative, Seattle, WA (Dr Simon); Galt Associates, Blue Bell, PA (Dr Stang); and World Health Organization, Geneva, Switzerland (Dr Ustun). Address correspondence to: R. C. Kessler, Department of Health Care Policy, Harvard Medical School, 180 Longwood Avenue, Boston, MA 02115; e-mail: [email protected]. Copyright © by American College of Occupational and Environmental Medicine DOI: 10.1097/01.jom.0000052967.43131.51 156 The WHO HPQ Kessler et al

Upload: others

Post on 10-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kessler Intro Who HPQ 2003 - BDMS CME Online

The World Health Organization Health andWork Performance Questionnaire (HPQ)

Ronald C. Kessler, PhDCatherine Barber, MPAArne Beck, PhDPatricia Berglund, M.BAPaul D. Cleary, PhDDavid McKenas, MDNico Pronk, PhDGregory Simon, MDPaul Stang, PhDT. Bedirhan Ustun, MDPhillip Wang, MD, ScD

This report describes the World Health Organization Health andWork Performance Questionnaire (HPQ), a self-report instrumentdesigned to estimate the workplace costs of health problems in terms ofreduced job performance, sickness absence, and work-related accidents-injuries. Calibration data are presented on the relationship betweenindividual-level HPQ reports and archival measures of work perfor-mance and absenteeism obtained from employer archives in four groups:airline reservation agents (n � 441), customer service representatives(n � 505), automobile company executives (n � 554), and railroadengineers (n � 850). Good concordance is found between the HPQ andthe archival measures in all four occupations. The paper closes with abrief discussion of the calibration methodology used to monetize HPQreports and of future directions in substantive research based on theHPQ. (J Occup Environ Med. 2003;45:156–174)

I nterest in the social consequences ofillness has broadened in the pastdecade as epidemiologists havejoined health economists and healthservices researchers to devise meth-ods that rationalize the allocation ofhealth care resources.1,2 Researchshowing that untreated and under-treated health problems exact sub-stantial personal costs from the indi-viduals who experience them as wellas from their families, employers,and communities has been a centralpart of this work.3,4 Among the mostimportant of these results have beenthose concerning the workplace costsof illness from the perspective of theemployer.5,6 These costs have enor-mous implications for the economy.For example, a recent analysis esti-mated that depression causes an an-nual loss of $33 billion in workabsenteeism in the U.S.7 Given thelow rate of depression treatment8 andthe fact that treatment substantiallyimproves role functioning amongpeople with depression,9,10 such datasuggest that it might be cost-effective for employers to increasethe proportion of depressed workerswho receive treatment.11 Similar ar-guments have been made for a num-ber of other illnesses,12–14 the notionbeing that targeted expansion of em-ployee health care benefits, includingan outreach component, can repre-sent an investment opportunity foremployers.

Only a small minority of employ-ers has as yet been convinced of thebusiness case for targeted investmentin employee health care. This ispartly a result of the absence of datato evaluate the indirect costs of un-

From Harvard Medical School, Boston, Massachusetts (Dr Kessler, Dr Cleary, Dr Wang); HarvardSchool of Public Health, Boston, Massachusetts (Ms Barber); Kaiser-Permanente, Denver, CO (DrBeck); University of Michigan, Ann Arbor, MI (Ms Berglund); American Airlines, Fort Worth, TX (DrMcKenas); HealthPartners, Minneapolis, MN (Dr Pronk); Center for Health Studies, Group HealthCooperative, Seattle, WA (Dr Simon); Galt Associates, Blue Bell, PA (Dr Stang); and World HealthOrganization, Geneva, Switzerland (Dr Ustun).

Address correspondence to: R. C. Kessler, Department of Health Care Policy, Harvard MedicalSchool, 180 Longwood Avenue, Boston, MA 02115; e-mail: [email protected].

Copyright © by American College of Occupational and Environmental Medicine

DOI: 10.1097/01.jom.0000052967.43131.51

156 The WHO HPQ • Kessler et al

Page 2: Kessler Intro Who HPQ 2003 - BDMS CME Online

treated and under-treated healthproblems in the workplace. Employ-ers who have access to integrateddatabases on medical expenditures,pharmacy expenditures, workplaceinjuries, and disability can go partway to resolve this problem, as suchdatabases can be used to evaluate theeffects of changes in health planbenefits over time on a number ofimportant employer costs.15 How-ever, even completely integrated da-tabases typically lack two criticaltypes of information that are neededto make a strong business case forexpanded investment in employeehealth care. First, few companieshave accurate individual-level jobperformance data for most of theiremployees. Even basic data on sick-ness absence are generally availableonly for blue-collar and pink-collarworkers, but not for white-collarworkers, whereas data on job perfor-mance are usually either nonexistent,superficial, or very difficult to obtainin machine-readable form. Second,medical data, when available, typi-cally focus on treated health prob-lems. No information is generallyavailable on untreated health prob-lems other than in companies thatperform routine physical examina-tions on all their employees and linkthese data with information about jobperformance. In the absence of suchlinked data files, it is impossibleeither to estimate the number ofworkers with unmet need for treat-ment or the effects of untreatedhealth problems on work perfor-mance.

Recognizing the need for suchdata, a number of health servicesresearchers have developed self-report measurement tools to collectdata in employee surveys on un-treated health problems and workperformance. Although inferior toobjective performance-based mea-sures, self-report work performancemeasures can be extremely usefulwhen objective measures are un-available. This is especially truewhen the self-report measures arecalibrated against objective measures

in such a way that scores on theself-report measures can be mean-ingfully interpreted.

Lynch and Riedel16 recently re-viewed the most widely used workperformance measures in the litera-ture. Their review showed that thesemeasures generally have good inter-nal consistency reliability and goodface validity, but have not been com-pared to objective data on work per-formance either to demonstrate theirvalidity or to generate calibrationrules. The current report presentsdata of this sort for one such self-report work performance measure,the World Health Organization’s(WHO) Health and Work Perfor-mance Questionnaire (HPQ). Dataare presented from four HPQ calibra-tion surveys, each carried out in aseparate corporation and focused ona single type of worker for whomarchival data were available either onsickness absence, work performance,or both. The four samples includereservation agents working for a ma-jor airline, customer service repre-sentatives working for a large telecom-munications company, executivesworking for a major automobile man-ufacturer, and railroad engineers work-ing for a large railroad company. Dataare presented on the relationships ofindividual-level HPQ job performancemeasures with archival measures usedby the companies to monitor workerperformance. Data are also presentedon comparisons between individual-level HPQ absenteeism data and em-ployer payroll records. The papercloses with a brief discussion of cali-bration methodology and future direc-tions in substantive research based onthe HPQ.

Development of the HPQ

BackgroundThe HPQ was developed as an

expansion of the work role module inthe WHO Disability AssessmentSchedule (WHO-DAS).17 TheWHO-DAS is a self-report measureof role functioning that was createdby WHO for use in community sur-

veys as well as in intervention stud-ies aimed at reducing the role impair-ments associated with untreated orunder-treated health problems. Thefull WHO-DAS includes scales ofrole functioning in each of the coredomains of the newly revised Inter-national Classification of Function-ing,18 whereas the HPQ focuses ex-clusively on the work role domain.

HPQ development began with areview of other existing scales, fol-lowed by pilot interviews, develop-ment of preliminary questions, sys-tematic evaluation and refinement ofthese questions by experts in surveyquestion wording using the methodsdescribed by Converse and Presser,19

and additional pilot testing with cog-nitive debriefing interviews aimed atdetecting and removing ambiguitiesin question wording.20 Full-scale pi-lot surveys were then carried out inthree managed care samples and onelarge corporation sample in order tostudy the psychometric properties ofthe scales and to examine the effectsof various chronic conditions onHPQ measures of work perfor-mance.21 The final HPQ, based on allthese earlier studies, was then admin-istered to the four calibration sam-ples described in the current report.

The complete text of the HPQ isposted at: http://www.hcp.med.har-vard.edu/hpq. Benchmark surveyscores, information on using theHPQ, and updates of ongoing HPQevaluations will be posted on this siteas they become available.

Work PerformanceThree outcomes are traditionally

measured in studies conducted byorganizational researchers on the ef-fects of various workplace produc-tivity enhancement interventions: ab-senteeism, work performance, andjob-related accidents.22 We decidedto measure all three of these out-comes in the HPQ. Work perfor-mance is obviously the most difficultof these three to assess. Indeed, thedecision to develop the HPQ wasbased largely on our failure to findan existing self-report measure of

JOEM • Volume 45, Number 2, February 2003 157

Page 3: Kessler Intro Who HPQ 2003 - BDMS CME Online

work performance that met ourneeds.

The ideal way to assess work per-formance, of course, would be bymeans of objective performance-based assessment rather than self-report. Many employers have devel-oped assessments of this sort for atleast some of their workers.23,24

However, these systems vary enor-mously in coverage as well as insophistication, making them impos-sible to use in broad-based studies ofhealth and work performance. An-other possibility is to use specialperformance-based tests, many ofwhich have been developed in con-junction with the Department of La-bor Occupational Information Net-work (O*NET) system of jobclassification.25 However, these testsassess ability rather than actual per-formance on the job, making it im-possible to evaluate under-perfor-mance by workers with high nativeability who fail to perform up to theirability on the job.26 Based on theseconsiderations, we concluded thatself-report measures are the mostfeasible tools for our purposes.

A comprehensive review of theliterature found a number of usefulself-report measures of work perfor-mance.27–29 The most compelling ofthese, however, focused on singleoccupations and included questionsthat were tailored to the unique de-mands of those occupations.30 –32

The measure we needed, in compar-ison, had to be appropriate across thefull range of the occupational spec-trum. While we found scales of thissort in our review, they all sufferedfrom one of three serious problems:unequal relevance across the fullrange of the occupational spectrum;incomplete coverage of importantperformance domains; or lack oftranslation rules to link domain-specific performance measures withan overall assessment of work per-formance. These problems arebriefly reviewed in the next fourparagraphs.

The problem of unequal relevancestems from the difficulty of develop-

ing concrete self-report work perfor-mance questions that are equally rel-evant to workers across the full rangeof the occupational spectrum. Anumber of work performance scalescan be faulted along these lines. TheWork Productivity Scale,33 for ex-ample, includes questions about put-ting off business phone calls andfailing to attend business meetings,whereas the Stanford PresenteeismScale34 includes questions about be-ing cranky with work subordinatesand failing to find new-creative so-lutions to work problems. Thesequestions are much more relevant towhite-collar workers than to blue-collar or pink-collar workers, intro-ducing a bias into these scales thatcan lead to an overestimation of theprevalence of impaired work perfor-mance among white-collar workerscompared to other workers. Thisbias, in turn, can lead to biased re-sults suggesting that health problemshave greater effects on white-collarworkers than on other workers andthat the health problems most rele-vant to the performance of white-collar workers have greater adverseeffects on work functioning than thehealth problems most relevant to theperformance of other workers.

Other measures of work perfor-mance have been designed explicitlyto overcome the problem of equalrelevance across the occupationalspectrum35,36 by including brief as-sessments of health-related impair-ments in each of a wide number ofbasic domains of role performance(eg, mobility, vision and hearing,fine motor coordination, concentra-tion, communication). The hope isthat this heterogeneous coverage willtap the main job demands of workersin all occupations. However, no sys-tematic attempt is made in thesescales to assess all the importantdomains of work performance thatneed to be covered. As a result,although these scales cover a numberof domains, there is no reason tobelieve that the coverage is eithercomprehensive or comparable acrossall occupations. The depth of this

problem can be seen by examiningthe Dictionary of Occupational Titles(DOT),37 a document prepared bythe Department of Labor (DOL) todescribe the skill sets needed in theover 22,000 occupations in the U.S.labor force. Systematic observationof day-to-day work performance ineach of these occupations by DOLemployees documented over 50 dif-ferent work performance domains.No existing work performance scaleeither assesses all these domains orattempts systematically to sampleacross these domains.

This problem could be overcomeif a small number of global workperformance domains was isolatedempirically. An early psychometricanalysis designed to search for suchglobal domains in the DOT yieldedpromising results.38 More currentwork along the same lines mightevolve from DOL’s new on-lineO*NET system of job classification(www.onetcenter.org). Indeed, onegoal of the O*NET system is toassemble a set of objective perfor-mance-based tests that will cover allthe many different dimensions ofwork performance in the O*NETclassification. Self-report measuresalready exist for some of the O*NETdimensions.39,40 It is conceivablethat self-report measures of the otherO*NET dimensions could be devel-oped. However, a comprehensive setof such scales, if they were everdeveloped, would likely take hoursor even days to administer.

Furthermore, even assuming thatcomprehensive O*NET scales couldbe developed and used, a final prob-lem would remain: that no rules existto combine the separate domainscores into an overall measure ofwork performance that is validacross all occupations. Such combi-nation rules would, at a minimum,require different weights to be ap-plied across domains for differentoccupations. Health-related difficul-ties in the domain of unskilled man-ual labor (eg, digging, lifting, carry-ing), for example, are presumablymuch more impairing to a manual

158 The WHO HPQ • Kessler et al

Page 4: Kessler Intro Who HPQ 2003 - BDMS CME Online

laborer than to a lawyer. This differ-ence would have to be taken intoaccount in combining domain perfor-mance scores into an overall workperformance score that appliesequally well to laborers, lawyers, andto workers in all the thousands ofother occupations in the labor force.In addition, the correct combinationrules are likely to be quite complex,involving different domain weights,nonlinearities, and nonadditivitiesfor particular occupations or classesof occupations. It might be possibleto develop such rules by analyzingextremely large databases containingthe appropriate variables. However,in the absence of such databases andsuch rules, neither of which currentlyexists, it is unclear how one couldarrive at a principled basis for com-bining domain-specific work perfor-mance scores into a valid overallassessment of work performance.

Based on these considerationsabout the current intractability of theabove problems, we decided to use asimple self-report global rating scaleto assess work performance in theHPQ. In this approach, respondentsare asked to rate their overall workperformance during the past fourweeks on a 0-to-10 self-anchoringscale in which 0 is defined as the“worst possible work performance” aperson could have on this job and 10is defined as “top work perfor-mance” on this job. Our reasoning inselecting this simple approach wasthat workers are in a better positionthan researchers to recognize thework performance domains that aremost relevant to their particular oc-cupations, to evaluate their recentperformance in these domains, and toarrive at a rating of their overallwork performance based on thisevaluation.

At the same time, we know fromthe methodological literature that re-sponses to 0-to-10 global ratingscales can be improved by two re-finements, both of which we used inthe HPQ: decomposition41 and inter-nal anchoring.20 Decomposition isone of several strategies that have

been developed by survey methodol-ogists to facilitate active memorysearch in response to complex surveyquestions. Research on the cognitiveprocesses used to arrive at accurateanswers to complex survey questionsshows that active memory search andreview of component experiencessubstantially improve response accu-racy.42 This same research shows,though, that many respondents givesuperficial answers based on generalsemantic memories or other responseheuristics because they are unwillingto engage in serious memorysearch.43 Decomposition addressesthis problem by asking preliminarycomponent questions that force re-spondents to engage in active mem-ory search before being asked thecomplex question.

Decomposition is used in the HPQby asking respondents a series ofquestions that require them to reviewcritical aspects of their work perfor-mance before assigning themselves arating on the global 0-to-10 scale.Specifically, we ask componentquestions about quantity of work(how often during the recall period:the respondent’s speed/productivityof work was lower than expected, therespondent did no work at timeshe/she was expected to be working),quality of work (how often duringthe recall period: the respondent didnot work as carefully as he/sheshould, his/her work quality waslower than expected, he/she was day-dreaming and not concentrating onwork), interpersonal aspects of work(how often during the recall period:the respondent had trouble gettingalong with others at work, had diffi-culty controlling his/her emotions atwork, and avoided interacting withothers at work), special work suc-cesses, special work failures, andaccidents-injuries. All of these ques-tions were explicitly designed to besufficiently general that they apply toall occupations, but sufficiently fo-cused that they facilitate relevantmemory search and review. Theglobal 0-to-10 scale is administered

only after these memory-priming de-composition questions are asked.

Internal anchoring is an especiallyimportant strategy to improve theaccuracy of responses to questionsthat use self-anchoring responsescales. The issue here is that mostself-anchoring scales define only theends of the distribution (eg, 0 definesthe “worst possible performance”whereas 10 defines the “best possibleperformance”), but not intermediatevalues, while the vast majority ofrespondents rate themselves as hav-ing values between these extremesand have no guidelines for selectingamong intermediate values.

Schwarz44 has shown that thisproblem can be addressed by rescal-ing 0-to-10 scales to range from �5to � 5 with a clear 0 point in themiddle. This rescaling substantiallyimproves the accuracy of response toself-anchoring scales in the middlepart of the scale distribution by high-lighting the midpoint. The difficultywith this approach in the case ofrating work performance, however,is that we have no reason to believethat the performance of the averageworker is at a level halfway betweenthe theoretical extremes of worst andbest performance. Indeed, our pilotresearch found that most workersreport average performance in theiroccupation to be substantially abovethe midpoint of this range. Based onthis evidence, we designed the HPQrating so that respondents could gen-erate their own internal anchors be-fore responding. This is done byasking each respondent to give sep-arate ratings for the average workeron their job and for their own usualperformance before rating their re-cent performance. In addition to pro-viding internal anchors, these addi-tional rating questions provideinformation that allows us to calcu-late ipsative scores of recent perfor-mance in comparison to usual perfor-mance and in comparison to theperformance of other workers. Inorder to obtain multiple indicatorsfor self-other comparisons, the HPQalso includes a separate question that

JOEM • Volume 45, Number 2, February 2003 159

Page 5: Kessler Intro Who HPQ 2003 - BDMS CME Online

asks respondents explicitly to com-pare their own recent performancewith that of the average worker onthe same job using a standard seven-point better-worse unfolding scale(ie, better, worse, or about the sameand, if either better or worse, a three-category rating of personal perfor-mance as either a lot, some, or only alittle better/worse than the averageworker).

AbsenteeismMost health and work perfor-

mance instruments assess absentee-ism with a single question about thenumber of days in the past month (orother recall period) the respondentmissed a day of work because ofillness. Previous research has showngood agreement between these self-reports and employer records of ab-senteeism.45 However, the results ofcognitive interviews led us to use amore detailed set of questions aboutabsenteeism in the HPQ. Four refine-ments were involved. First, we de-cided to focus on hours rather thanon days of work during the past fourweeks based on the fact that workersdiffer substantially in the number ofhours they work as well as inwhether they work the same numberof hours each day. Second, in addi-tion to asking about hours missed onsickness absence days, we ask abouthours missed on workdays (ie, com-ing in late or going home early) dueto the fact that a substantial propor-tion of missed work time occurs ondays when people come to work.Third, we ask about extra hours ofwork (ie, coming in early, goinghome late, working on days off)because of the fact that many work-ers put in extra hours to make up forsickness absence. Fourth, althoughwe distinguish between sickness ab-sence and other types of absence (eg,vacation, absence due to a familyemergency etc.), we also create acombined measure of total hours ab-sent because workers who have usedup their allotted sick days often useaccrued personal days or vacationdays when they are too ill to come to

work. In addition, many employersconsolidate the number of paid ab-sence days they allow their workersto take into a single category thatcombines vacation and personal daysand sickness absence days, makingthe distinction among these catego-ries artificial.

The question sequence in the HPQabsenteeism series makes use of thesame decomposition strategy de-scribed above in the discussion of thework performance measure. Specifi-cally, the series begins by askingseparately about number of daysmissed in the past four weeks forvacation and sickness absence, fol-lowed by number of partial work-days, and about days of extra hoursworked. The aim is to focus memorysearch by simplifying the task ofcalculating total lost work hours inresponse to a single question. It isnoteworthy that the decompositionquestions ask about days rather thanhours, even though hours are the unitof ultimate interest, because cogni-tive interviews show that the vastmajority of working people recon-struct work schedules more naturallyin terms of days than hours. At theend of this sequence, we ask aboutoverall hours worked. It is notewor-thy that we ask about hours workedrather than hours missed becausecognitive interviews show that mostworkers think more naturally interms of the former than the latter. Afinal question is asked about thenumber of hours each week the re-spondent is normally expected towork in order to have a denominatorfor calculating a percentage measureof work loss.

Job-Related AccidentsAlthough job-related accidents are

uncommon, they are important be-cause of their potential high cost. Weexplored a number of options forasking fully structured questionsabout accidents. In the end, though,the rarity and great variety of acci-dents led us to include a single open-ended question about job-related ac-cidents-injuries in the final HPQ.

This question is worded in such away that respondents are explicitlyasked to include incidents that ledeither to 1) breakage or other loss ofproperty; 2) delays in production orother decreases in work performanceof the respondents or other workers;3) physical injury of the respondentor others; and 4) serious risk of loss,delay, or injury. The textual re-sponses to these questions are con-verted into general anonymousvignettes and presented to supervi-sors for scoring in terms of theirmonetary cost to the company.Open-ended reports are also obtainedfor responses to questions about spe-cial successes (eg, making a big sale,getting a bonus or a promotion, beingselected as the employee of themonth) and special failures (eg, fail-ing to meet a production quota, rep-rimand from a supervisor, failing toget an expected bonus or promotion).As with accidents-injuries, specialsuccesses and failures, althoughcomparatively uncommon, are veryimportant components of the overallindirect costs of illness and the cost-savings associated with treatment.As with accidents, open-ended re-sponses to the questions about suc-cesses and failures are converted intogeneral anonymous vignettes andpresented to supervisors to obtainestimates of the costs to employersof failures and the values of suc-cesses.

The HPQ Calibration Survey

SamplesCalibration surveys were per-

formed in four occupations to com-pare HPQ work performance andabsenteeism measures with archivaldata obtained from employerrecords. No attempt was made tovalidate the HPQ question about ac-cidents-injuries because of the rarityof these events. The four occupationswere reservation agents working fora major airline, customer service rep-resentatives working for a large tele-communications company, execu-tives working for a major automobile

160 The WHO HPQ • Kessler et al

Page 6: Kessler Intro Who HPQ 2003 - BDMS CME Online

manufacturer, and railroad engineersworking for a large railroad com-pany. Names, home addresses, andhome telephone numbers were ob-tained for initial samples of between1131 and 1491 workers in each oc-cupation. An advance letter (or, inthe case of the executives, an e-mail)was sent to each predesignated re-spondent by the medical director oftheir company. The letter informedrecipients that the company was col-laborating with researchers fromHarvard Medical School (HMS) in asurvey of employee health and workperformance. The letter went on tosay that an HMS telephone inter-viewer would contact them in thenext week to carry out a telephoneinterview. The letter made it clearthat participation was completelyvoluntary and anonymous. An 800number to the HMS study managerwas included in the letter for recipi-ents who had questions or whowanted to opt out of the survey. Aprestamped and pre-addressed returnpostcard was included in the mailingfor recipients who wanted either toreport good times to be reached or toopt out by mail. Professional tele-phone interviewers made 20 attemptsto contact each of those who did notinitially opt out. Verbal informedconsent was obtained before admin-istering the survey. These recruit-ment and consent procedures wereapproved by the Human subjects

Committee of Harvard MedicalSchool.

As shown in Table 1, the tele-phone lists had substantial propor-tions of invalid numbers (15.0 –37.6% across samples) and highproportions of answering machines(10.3–27.8% across samples). Therefusal rate (including initial opt-outs) was in the range 8.2–16.8%across samples, while the coopera-tion rate (the percent of completedinterviews among people who werecontacted) was in the range 64.0 to85.6% across samples. Calibrationinterviews were completed with 441reservation agents, 505 customer ser-vice representatives, 554 executives,and 850 railroad engineers. The de-mographic distribution of the sam-ples is presented in Table 2. Reser-vation agents were largely women,while executives and railroad engi-neers were largely men. The modalage range of reservation agents andcustomer service representatives was30 to 44, whereas executives andrailroad engineers were generallyolder (the mode being in the agerange 45 to 59). Railroad engineershad the lowest education (50.6% hadno more than a high school educa-tion), while executives had the high-est educations (98.7% were collegegraduates).

Once the calibration survey wascompleted, probability subsamplesof 105 reservation agents and 181

customer service representativeswho participated in the calibrationsurvey were recruited into a 1-weekfollow-up Experience SampleMethod (ESM) evaluation46 of mo-ment-to-moment work experience.The ESM design involved givingparticipants a beeper and an ESMdiary to keep with them at all timesduring the seven-day study week.The beeper was programmed by anauto-dialer to be called at five ran-dom times each day, with random-ization beginning at the start of theworkday (or, on regularly scheduleddays off work, 1 hour after the re-spondent reported typically awaken-ing) and ending two hours before therespondent reported typically goingto bed on that day of the week. Aconstraint was imposed on the ran-domization that no call could bemade less than 90 minutes after thepreceding call. The respondent wasasked to fill out the diary as soon asthe beeper went off. The diary askedstructured questions about whetherthe respondent was at work and, ifso, about quantity and quality ofwork at the moment-in-time whenthe beeper went off. The last entry ofeach day asked additional questionsabout the day overall. A separatediary book was provided for each ofthe seven days. Respondents wereasked to mail back each day’s com-pleted book the following morning ina pre-stamped, pre-addressed mailer

TABLE 1Sample Dispositions

ReservationAgents

CustomerService

Representatives ExecutivesRailroad

Engineers

% % % %

Invalid Numbers1 29.3 37.6 15.0 15.8Answering Machines 19.9 18.0 27.8 10.3Refusals 12.2 14.1 8.2 16.8Cooperation Rate 75.9 64.0 85.6 77.1Initial Sample (1143) (1713) (1131) (1491)Completed Interviews2 (441) (505) (554) (850)

1 Invalid numbers are defined as disconnected numbers with no forwarding number, incorrect numbers (e.g. businesses, fax machines,respondent unknown), Good numbers to respondents who report that they are no longer working for the company, and no contact after 20call attempts.

2 Cooperation rate is defined as completed interviews divided by completed interviews plus refusals.

JOEM • Volume 45, Number 2, February 2003 161

Page 7: Kessler Intro Who HPQ 2003 - BDMS CME Online

in order to avoid the problem ofretrospective completion that some-times occurs when diaries are sentback only at the end of the study.47

Reminder phone calls from data col-lection staff were made on theevening of the first, third, and fifthdiary days to encourage respondentsto stick with the task.

With 35 possible diary entries foreach respondent (5 per day � 7days), there were 3675 (105 � 35)logically possible completed ESMdiary entries for the reservationagents and 6335 (181 � 35) for thecustomer service representatives.The response rates for the entries inthe two samples were, respectively,61.3% (n � 2253) and 68.7% (n �4353). Among reservation agents,44.8% of valid entries were madewhile the respondent was at work(n � 1010) and 80.8% of the latterwere made while the respondent wasworking as opposed to on break or atlunch (n � 816). Among customerservice representatives, 56.3% ofvalid entries were made while therespondent was at work (n � 2450)and 78.5% of the latter were madewhile the respondent was working(n � 1926). A 1-week version of theHPQ work performance and absen-teeism questions was administered aspart of the debriefing telephone in-

terview that was administered theday after the end of the diary week.This allowed us to calibrate HPQratings against aggregated ESM re-ports in an effort to evaluate theeffects of recall bias on HPQ reports.The debriefing interviews were com-pleted with 91 of the 105 reservationagents (86.7%) and 172 of the 181customer service representatives(95.0%).

The Archival Work PerformanceMeasures

The four samples considered herewere selected because the perfor-mance of the workers in these occu-pations is evaluated using standard-ized assessments. Customer servicerepresentatives and airline reserva-tion agents receive monthly supervi-sor performance ratings based on acombined score for quantity of work(number of cases resolved, numberof tickets sold) and quality of work(based on supervisor review and cod-ing of audio-taped customer interac-tions). The executives all have 360-peer evaluations of overallleadership based on the CampbellLeadership Index.48 The railroad en-gineers receive performance ratingsfor each trip they make that com-bines information about speed track-

ing (ie, arriving at benchmark pointson the route as close as possible totarget times), brake wear, fuel effi-ciency, and a number of other safetyand efficiency indicators. In addi-tion, the two ESM samples generatedmoment-in-time data on work per-formance that avoid the recall biasinherent in more conventional self-report measures. As a result, thesemeasures were treated as additionaloutcomes in the calibration analysis.This was done by combining ESMratings into a scale with four itemsderived from exploratory factor anal-ysis of moment-in-time performancereports (speed of work, quality ofwork, concentration on work, andperceived success at current worktask), each of which was rated on a 1to 7 self-anchoring scale of either“low” to “high” quality and speed or“not at all” to “very much” concen-trating and succeeding. The Cron-bach’s alpha for this scale was 0.74for the reservation agents and 0.81for the customer service representa-tives.

We were required by our Institu-tional Review Board (IRB) to obtaininformed consent from respondentsbefore we could access their perfor-mance records. This consent was ob-tained in conjunction with the base-line telephone HPQ interviews. As a

TABLE 2Demographic Distributions of the Samples

ReservationAgents

CustomerService

Representatives ExecutivesRailroad

Engineers

% (se) % (se) % (se) % (se)

SexFemale 80.3 (1.9) 47.2 (2.2) 19.3 (1.7) 2.4 (0.5)Male 19.7 (1.9) 52.7 (2.2) 80.7 (1.7) 97.6 (0.5)

Age18–29 6.6 (1.2) 23.9 (1.9) 0.0 (0.0) 4.5 (0.7)30–44 46.7 (2.4) 49.3 (2.2) 22.7 (1.8) 37.3 (1.7)45–59 40.5 (2.4) 25.2 (1.9) 69.5 (2.0) 52.9 (1.7)� 60 6.2 (1.2) 1.6 (0.6) 7.6 (1.1) 5.3 (0.8)

Education�High school 0.2 (0.2) 0.8 (0.4) 0.0 (0.0) 1.9 (0.5)High school 21.9 (2.0) 15.7 (1.6) 0.5 (0.3) 48.7 (1.7)Some college 36.4 (2.4) 43.3 (2.2) 0.7 (0.4) 37.1 (1.7)�College 41.4 (2.4) 40.3 (2.2) 98.7 (0.5) 12.4 (1.1)(n) (441) (505) (554) (850)

162 The WHO HPQ • Kessler et al

Page 8: Kessler Intro Who HPQ 2003 - BDMS CME Online

result, we were not able to evaluatethe completeness of the archival in-formation before selecting and inter-viewing the respondents. This led toconsiderable loss of objective data.The most extreme loss was for thecustomer service representatives,whose archival work performanceand absenteeism data were unusablebecause of a corruption of the iden-tification number link in the em-ployer records. Archival perfor-mance data were obtained, though,for 441 reservation agents, 269 exec-utives, and 847 railroad engineers.The reservation agent data were themost precise with regard to time inthat they were based on supervisorratings for the month prior to theHPQ survey. The executive data, incomparison, were based on peerevaluations made at the end of aleadership-training program that re-spondents participated in for oneweek at variable times up to twoyears before the survey. The ratingsfor railroad engineer were the leastprecise with regard to time in thatthey were aggregated at the end ofeach trip into a summary score thatrepresented the engineer’s cumula-tive performance over many years. Itwas impossible to recover disaggre-gated summary ratings from this file.However, we were able to obtaininformation about serious engineerperformance problems in the monthbefore the interview from a separateperformance action file. As a result,we focused on this measure in theevaluation of the HPQ among rail-road engineers. The EMS ratings,

finally, were evaluated for the 816moments-in-time when the 91 reser-vation agents who completed thepost-ESM debriefing interview wereat work and working (ie, not onbreak or at lunch) and for the 1926moments-in-time when the 172 cus-tomer service representatives whocompleted the post-ESM debriefinginterview were at work and working.

Four of the five outcome workperformance measures, the exceptionbeing the dichotomous railroad engi-neer performance action measure,were transformed to 0–100 scalesfrom their original metrics. Thesetransformations were then used todivide workers into top performers,low performers, and average per-formers. The decision to make thisthree-part division was based on theresults of focus groups with manag-ers, who reported that they use workperformance measures largely to tar-get high performers for reward andlow performers for remediation andthat they generally do not make dis-tinctions within the middle part ofthe range. Top performance was de-fined as the top 20th percentile of therange of each objective performancemeasure, while low performance wasdefined as the bottom twenty percen-tile of each objective performancemeasure. As shown in Table 3, allthe outcome performance measureswere refined enough at both tails ofthe distribution to make distinctionsvery near these target proportions. Itis also noteworthy that the ESMperformance measures have a widerrange than the archival measures,

suggesting that there is some subjec-tive truncation of supervisor ratings.This is especially clear in the case ofthe supervisor ratings of reservationagents, where the lowest score is 79on the 0-to-100 scale. The empiricaldistribution of the ESM scale forthese same workers, in comparison,spans the entire scale range, showingthat the workers themselves makemore subtle distinctions about theirwork performance than do supervi-sors.

This observation raises the ques-tion as to how objective the archivaldata are. Although these data aretreated as objective for purposes ofcalibrating the HPQ, we are awarethat the archival data, especiallythose based on supervisor ratings (inthe case of the reservation agents andcustomer service representatives)and peer ratings (in the case of theexecutives) are not without error.However, these measures are the ac-tual measures used by employers tomonitor the performance of workersand are, in this sense, “real” in anoperational sense. Yet we would notexpect perfect consistency betweenHPQ measures and these archivalmeasures because we recognize thatthe latter are imperfect. Indeed, theHPQ ratings might be more accuratethan the archival data in some re-spects. However, it is nonethelessimportant for us to demonstrate thatthe HPQ ratings are meaningfullyrelated to the archival measures toassure that the HPQ is tapping thesame aspects of workplace perfor-mance as those measured by the

TABLE 3Distributions of the Work Performance Outcome Measures1

Range Low High

(n)Lower Upper % (se) Score % (se) Score

Reservation agent supervisor ratings 79 100 21.1 (2.0) 95 25.6 (2.2) 100 (441)Reservation agent ESM 0 100 19.7 (1.4) 50 22.2 (1.5) 92 (105)Customer service representative ESM 0 100 18.7 (0.9) 46 18.8 (0.9) 83 (181)Executive leadership scores 32 80 20.4 (2.5) 55 20.4 (2.5) 67 (269)Railroad performance actions 0 1 0.8 (0.3) 0 — — — (847)

1 The Railroad Performance Action measure is a dichotomy (yes/no). All other measures are scales that have been transformed to atheoretical range between 0 and 100, with higher scores indicating better performance.

JOEM • Volume 45, Number 2, February 2003 163

Page 9: Kessler Intro Who HPQ 2003 - BDMS CME Online

work performance measures actuallyused by employers.

The Payroll Record Measures ofAbsenteeism

Payroll record measures of absen-teeism were available for three of thefour occupations, the exception be-ing executives. In the case of reser-vation agents, data were obtained onhours scheduled and payroll recorddata were obtained on hours actuallyworked each day during the 4 weeksleading up to the HPQ interview.These data were aggregated intosummary measures of hours workedand hours missed in the 1 week and 4weeks before the interview. In thecase of railroad engineers, who havean erratic work schedule, data wereavailable on the days they werescheduled to work when they eithercalled in sick or were absent forsome other reason. These recordswere aggregated into summary mea-sures of days of work missed in theone week and four weeks before theinterview. In addition, the ESM datafor the reservation agents and cus-tomer services representatives werealso used to derive indirect measuresof work absence. This was possiblebecause one of the first questionsasked in these diaries was whetherthe respondent was at work, at home,in transit between work and home, orelsewhere at the time of the beep.Because of the fact that the ESMdata points were sampled at randommoments-in-time throughout theweek, these reports should providerepresentative data of the proportionof time respondents were at workover the week. Therefore, moment-to-moment reports of whether or notthe respondent was at work at thetime of a random beep were com-pared with HPQ reports about hoursworked and days of work missed toprovide an indirect validation of theHPQ reports.

As with the archival work perfor-mance measures, we recognize thatemployer payroll records can be im-precise because of workers or super-

visors making erroneous reportsabout time spent at work. However,for the occupations considered herepayroll records are likely to have ahigh degree of accuracy.

Analysis Methods: WorkPerformance

Calibration of the HPQ 0-to-10global work performance rating scaleagainst the archival measures andESM measures of high and low workperformance was carried out usinglogistic regression analysis in whichdichotomous measures of either highor low performance based on eitherthe archival measures or the ESMmeasures were the outcome variablesand dummy variables definingranges on the HPQ rating scale werethe predictors. Chi-square tests wereused to evaluate the global signifi-cance of the HPQ rating scale inthese analyses. Received OperatorCharacteristic (ROC) curves wereused to judge the strength of associ-ation between the HPQ ratings andeither the archival measures or theESM measures. Areas under theROC curves and their 95% confi-dence intervals were calculated bythe nonparametric method.49

Analysis methods: absenteeismAnalysis of HPQ absenteeism re-

ports was carried out in two ways.First, linear correlations were calcu-lated between HPQ self-reports andemployer payroll records of absen-teeism in the samples of reservationagents and railroad engineers. In thecase of the reservation agents, thiswas done for hours worked andhours missed. In the case of therailroad engineers, it was done fordays missed. Both one-week andfour-week recall periods were exam-ined. We also compared means forall these outcomes in the HPQ self-reports and the employer payrollrecords. Second, logistic regressionanalysis was used in the ESM per-son-time samples of reservationagents and customer service repre-sentatives to make an indirect evalu-ation of the HPQ self-reports about

absenteeism. The dependent variablewas a dichotomy for whether therespondent was at work on not ateach random moment-in-time, whilethe predictors were the HPQ self-reports of hours worked during theESM week and days missed duringthe week obtained in the post-ESMdebriefing interview. The equationswere estimated using a two-levelrandom-effects model that includedboth person-level controls (age, sex),and within-person controls (numberof days in the ESM study as of thetime of the beep, sequence of thebeep within the day) in order toimprove the precision of estimates.

Results

The Distribution of the GlobalHPQ Work Performance Ratings

The distributions of the HPQ0-to-10 global work performance rat-ings across the five different samplesused in the calibration analysis aswell as in the full customer servicerepresentative sample are presentedin Table 4. Three patterns are note-worthy. First, the lower end of thescale is truncated at 0–7 becauseonly a small minority of respondentsrated themselves less than 7 in any ofthe samples. This truncation im-proves the precision of the calibra-tion procedures described below.50

Second, there is a clear tendency forthe majority of respondents to ratethemselves in the high-but-not-perfect range8,9 much more so thanat the very top of the range.10 Be-tween 61.7% (railroad engineers)and 80.0% (executives) of respon-dents across samples rated them-selves 8 to 9 compared to between11.9% (executives) and 25.9% (rail-road engineers) who rated them-selves 10. The distribution of the fullreservation agent sample is includedin the table even though we have noarchival performance measures forthat sample in order to present acomparison with the distribution in

164 The WHO HPQ • Kessler et al

Page 10: Kessler Intro Who HPQ 2003 - BDMS CME Online

the ESM subsample of reservationagents. The latter purposefully over-sampled respondents from the fullsample who had low ratings in orderto increase statistical power in thatpart of the distribution. This wasdone based on evidence from thereservation agent sample, which wasthe first group we surveyed, that lowobjective work performance is con-centrated among respondents withratings in the 0–7 range on the scale.Third, the distribution in the reserva-tion agent sample is not dramaticallydifferent from the distributions inother samples despite the fact thatthe supervisor ratings of work per-formance in the reservation agentsample are much more truncatedthan the other archival measures.This lack of difference in Table 4 isconsistent with the suggestion men-tioned earlier in the article that res-ervation agent supervisor ratingsmight be truncated due to rater bias.

The Associations of HPQRatings with Archival and ESMPerformance Measures

The results of logistic regressionanalyses linking HPQ ratings withthe five outcome measures of lowwork performance are presented inTable 5. Logistic regression coeffi-cients have been exponentiated andare reported in the table in the formof odds-ratios (ORs). There is a con-sistently monotonic and statisticallysignificant relationship betweenHPQ ratings and the odds of lowarchival/ESM work performance inall five equations. It is important tonote that this association is notcaused exclusively by the differencebetween respondents with HPQ rat-ings of 0 to 9 versus 10, although thatdistinction is important in predictinglow performance in all five equa-tions. The association is also partlydue to the fact that respondents with

HPQ ratings of 0 to 7 have higherodds of low performance than thosewith ratings of 8 and that respon-dents with ratings of 8 (with theexception of customer service repre-sentatives) have higher odds of lowperformance than those with ratingsof 9.

The outcome with the widestrange of odds between workers withhigh and low HPQ ratings (12.3:1) isthe measure of railroad engineer per-formance action. The predictionequation for this outcome could onlybe estimated if we constrained theodds to be the same among respon-dents with HPQ ratings in the range8 to 10 as a result of the fact that thisis a rare outcome that is largelyconfined to engineers who rate them-selves at the low end of the HPQscale. The outcome with the narrow-est range of odds, in comparison, isreservation agent performance (3.2:1). As noted above in the description

TABLE 4Distributions of the HPQ Global Work Performance Ratings

HPQ Ratings

ReservationAgents

ReservationAgent ESM

CustomerService ESM Executives

RailroadEngineers

% (se) % (se) % (se) % (se) % (se)

0–7 14.7 (1.7) 17.0 (4.0) 45.6 (3.8) 8.2 (1.7) 12.4 (1.2)8 31.6 (2.7) 38.6 (5.2) 28.4 (3.5) 43.9 (3.0) 31.4 (1.6)9 34.0 (2.3) 29.5 (4.9) 21.9 (3.2) 36.1 (2.9) 30.3 (1.6)10 19.7 (1.9) 14.8 (3.8) 4.1 (1.5) 11.9 (2.0) 25.9 (1.5)(n) (441) (91) (172) (269) (847)

TABLE 5Associations of HPQ Global Ratings with Lowest 20 Percent of Archival and ESM Work Performance Outcome Measures

ReservationAgent

SupervisorRatings1

ReservationAgent ESM

CustomerService

RepresentativeESM

ExecutiveLeadership

Scores1

Railroad EngineerPerformance

Actions

Objective ratingsLow work performance2 OR (95% CI) OR (95% CI) OR (95% CI) OR (95% CI) OR (95% CI)

0–7 3.2* (1.3–7.5) 6.4* (1.7–24.0) 7.3* (1.6–33.0) 7.0* (1.3–37.9) 12.3* (1.3–112.3)8 2.4* (1.1–5.2) 1.6 (0.4–6.1) 2.8 (0.6–13.2) 5.4* (1.2–24.2) 1.0 —9 1.0 (0.4–2.3) 2.2 (0.6–8.2) 1.6 (0.3–8.0) 2.7 (0.6–12.6) 1.0 —10 1.0 — 1.0 — 1.0 — 1.0 — 1.0 —

�23 14.3* 21.7* 57.5* 8.9* 4.91*

(n) (441) (816) (1,926) (269) (847)

* Significant at the .05 level, two-sided test1 Based on 30-day job performance rating2 Model includes sex, age, day of study, and beep number as controls

JOEM • Volume 45, Number 2, February 2003 165

Page 11: Kessler Intro Who HPQ 2003 - BDMS CME Online

of the archival measures, this mea-sure has the narrowest range of rat-ings as well (79 to 100 on the 0 to100 scale). It is conceivable that thisrestricted range introduced impreci-sion into the definition of low per-formance, resulting in a dampeningof the OR associated with low HPQratings for this outcome. The ORsassociated with HPQ ratings of 8across the remaining outcomes areall lower than those associated withratings of 0 to 7. The ORs associatedwith HPQ ratings of 9 for theseequations are generally lower thanthose associated with ratings of 8.The ROC curves for the strength ofthe HPQ ratings in predicting thearchival and ESM outcomes areshown in Fig. 1. Areas under theROC curve, which can be interpretedas the proportion of times randomlyselected workers with low work per-formance could be distinguishedfrom other workers based on differ-ential HPQ ratings, range between0.63 and 0.69 across the samples.

The results of logistic regressionanalyses linking HPQ ratings withthe four archival or ESM measuresof high work performance are pre-sented in Table 6. There is a statisti-cally significant relationship be-tween HPQ ratings and the odds ofhigh archival or ESM work perfor-mance in the equations for reserva-tion agents and customer service rep-resentatives, but not in the oneequation for executives. The equa-tions in which the outcomes are theESM scores show consistent mono-tonicity of odds. The equation inwhich the reservation agent supervi-sor ratings are the outcome, in com-parison, shows a significant distinc-tion between 0–7 and 8 to 10 (�2

1 �6.4, P � 0.015), but no meaningfulvariation in odds among respondentswith HPQ ratings of 8, 9, or 10 (�2

2

� 2.5, P � 0.701). This restrictedrange of ORs in predicting reserva-tion agent supervisor ratings amongrespondents with HPQ ratings in therange 8 to 10 is similar to the pattern

seen in Table 5. The ROC curves forthe strength of the HPQ ratings scalein the three statistically significantequations are shown in Fig. 2. Areasunder the ROC curve range between0.59 and 0.72 across the samples.

We also evaluated the effects ofthe component questions about workperformance that were administeredin the survey before the global0-to-10 work performance rating. Asnoted earlier in the report, these in-cluded questions about quantity ofwork, quality of work, interpersonalaspects of work, work successes andfailures, and work-related accidents-injuries. Factor analysis showed thatthese measures, like other recentlydeveloped multi-item inventories ofself-reported work performance,34

form meaningful factors with goodinternal consistency reliabilities.

We found that these factors aresignificantly related to the archivaland ESM work performance mea-sures when considered one at a time.However, multivariate analyses

Fig. 1. Receiver operator characteristic curves for HPQ Global Ratings predicting archival and ESM measures of low work performance.

166 The WHO HPQ • Kessler et al

Page 12: Kessler Intro Who HPQ 2003 - BDMS CME Online

showed that they were generally notsignificant predictors of archival orESM measures of work performancein a prediction equation that con-trolled the effects of the global0-to-10 work performance rating.This means that the global ratingout-performances the factor scales

and, with one exception noted in thenext paragraph, captures the effectsof these scales. Why? The mostlikely reason is that the factor scalestap generic aspects of work thatmight vary in their importance foroverall work performance across oc-cupations and that doubtlessly fail to

tap all relevant aspects of work foran assessment of performance onthese jobs. The 0-to-10 self-rating, incomparison, asks the respondent, im-plicitly to use their knowledge of therelevant considerations in evaluatingwork performance to arrive at aglobal self-rating based on an assess-

TABLE 6Associations of HPQ Global Ratings with Highest 20 Percent of Archival and ESM Work Performance Outcome Measures

Reservation AgentSupervisorRatings1

Reservation AgentESM

Customer ServiceRepresentative ESM

ExecutiveLeadership

Scores1

Objective ratingsLow work

performance2 OR (95% CI) OR (95% CI) OR (95% CI) OR (95% CI)

0–7 1.0 — 1.0 — 1.0 — 1.0 —8 5.7* (1.6–20.1) 3.7* (1.3–10.7) 2.5* (2.8–10.4) 1.0 (0.3–3.3)9 3.8* (1.1–13.1) 4.4* (1.5–13.1) 5.5* (2.8–10.8) 1.4 (0.4–4.6)10 5.4* (1.6–19.4) 6.4* (1.8–22.2) 45.8* (11.4–184.7) 1.0 (0.3–4.2)

�23 8.92* 38.8* 28.7* 1.09

(n) (441) (816) (1,926) (269)

* Significant at the .05 level, two-sided test1 Based on 30-day job performance rating2 Model includes sex, age, day of study, and beep number as controls

Fig. 2. Receiver operator characteristic curves for HPQ global ratings predicting archival and ESM measures of high work performance.

JOEM • Volume 45, Number 2, February 2003 167

Page 13: Kessler Intro Who HPQ 2003 - BDMS CME Online

ment of all these considerations. Ap-parently workers are able to do abetter job of this than we were ableto do in developing our factor scales.

The only exception to this generalstatement was in the case of railroadengineers, where the componentquestion about work failure was asignificant predictor of the archivaloutcome measure. This was presum-ably the case because both the pre-dictor and the archival outcome mea-sure were highly skeweddichotomies and the archival mea-sure was heavily influenced by thediscovery of major performance er-rors that are presumably tapped moredirectly in the dichotomous questionabout failure than in the 0-to-10 rat-ing.

The Associations of HPQAbsenteeism Reports withPayroll Records

Linear correlations and compari-sons of means between HPQ self-reports about absenteeism and em-ployer payroll records are reported inTable 7. In the case of reservationagents, the comparisons are for hoursworked and hours missed over bothone-week and four-week recall peri-ods. These correlations are substan-tial in magnitude and higher for one-week (0.81– 0.87) than 4-week(0.71–079) recall. Self-reports con-sistently overestimate hours worked

and underestimate hours missed. Al-though these biases are fairly modestin absolute terms (1.5 to 3 hours perweek), they represent rather substan-tial proportional underestimations ofhours missed (19–44%). In the caseof the railroad engineers, the com-parisons are for days missed. Thesecorrelations are also substantial inmagnitude (0.61– 0.66), althoughsomewhat smaller than the correla-tions for reservation agents. As withreservation agents, engineers under-estimate absence by 0.3 to 0.4 daysper week, which represent 21 to 26%underestimations of absence com-pared to payroll records. Unlike res-ervation agents, though, 4-week re-call is as accurate as 1-week recallfor railroad engineers.

It is interesting to note that four-week estimates of absence are morethan four times as large as 1-weekestimates both for reservation agentsand for railroad engineers. Similarly,4-week estimates of hours workedare less than four times 1-week esti-mates. This is not caused by recallbias, as proven by the fact that thesame patterns exist in employer pay-roll records. The reason is that re-spondents were allowed to postponethe start date of their ESM datacollection if it was inconvenient forthem to begin on the date selected bythe research team. Debriefingshowed that these postponements

were often because of short-term ill-ness. This means that the ESMweeks are downwardly biased in es-timating the prevalence of absentee-ism. The fact that this shows up notonly in the means of the payrollrecord data but also in the means ofthe self-reports is an additional indi-rect indicator of the accuracy of theself-report data.

The results of logistic regressionanalyses of moment-in-time data onbeing at work versus not at work in theESM person-time samples are reportedin Table 8. The dependent variable is adichotomy for whether the respondentwas at work on not at the randommoment-in-time, whereas the predic-tors are standardized HPQ self-reportsobtained in the post-ESM debriefinginterview of hours worked during theESM week and days missed during theweek. The results are consistent in theseparate samples of reservation agentsand customer service representativesin showing statistically significant as-sociations between HPQ self-reportsand the moment-in-time ESM data. Aone standard deviation increase in theHPQ self-reported measure of hoursworked during the ESM week is asso-ciated with a doubling of the relative-odds that a respondent will be at workduring any randomly selected mo-ment-in-time during that week. A onestandard deviation increase in the HPQself-report measure of hours worked

TABLE 7Associations of self-reported work absence with payroll work absence among reservation agents and railroad engineers

PearsonCorrelation

Means

Z-testSelf-Report Payroll

I. Reservation AgentsA. One-week recall (n � 414)

Hours worked .87 28.7 25.7 2.9*Hours missed .81 3.8 6.8 3.3*

B. Four-week recall (n � 414)Hours worked .79 113.4 107.0 1.8Hours missed .71 27.3 33.7 2.1*

II. Railroad EngineersA. One-week recall (n � 847)

Days missed .61 1.1 1.5 3.6*B. Four-week recall (n � 847)

Days missed .66 4.8 6.1 3.4*

* Self-report significantly different from payroll at the .05 level, two-sided test.

168 The WHO HPQ • Kessler et al

Page 14: Kessler Intro Who HPQ 2003 - BDMS CME Online

during the ESM week, in comparison,is associated with a halving of therelative-odds of being at work duringany randomly selected moment-in-time during that week.

DiscussionThe results reported here docu-

ment that the HPQ generates mean-ingful measures of work perfor-mance and absenteeism. The onlynegative result is the failure of theHPQ to predict high work perfor-mance among automobile execu-tives. As these executives are theonly white-collar workers in thesample, this failure might reflect ageneral weakness of the HPQ inpredicting high performanceamong white-collar workers. Rep-lication of the calibration study inother white-collar occupations isneeded to evaluate this interpreta-tion. Before that time, though, theHPQ should be seen as useful forwhite-collar workers largely in as-sessing low performance ratherthan high performance.

This usefulness of the HPQ inevaluating work performance is as aglobal measure, as no componentmeasures are included in the scale.The HPQ can be used to assess theoverall effects of allergies, migraine,and other illnesses on overall workperformance in an entire workforceand, comparatively, across different

types of occupations. It cannot tellus, though, what aspects of perfor-mance are affected by these illnesses(eg, motor skills, concentration etc.).As noted in the introduction, thedecision to focus on global perfor-mance rather than on components isbased on our interest to monetize theworkplace costs of illness and thecost savings of health care interven-tions. The estimation of these mone-tary effects is much more easilyachieved by assessing global workperformance rather than selectedcomponents of performance. Mone-tizing component effects requires theresearcher to determine the impor-tance for overall work performanceof, say, a decrement in ability toconcentrate to a ditch digger or of adecrement in ability to lift heavyobjects to a lawyer. We decided thatthese evaluations were better left tothe worker-respondents themselvesin arriving at global assessments oftheir overall work performance.Monetizing component effects alsorequires the researcher to assumethat the components measured fullycapture all relevant aspects of workthat go into the creation of workperformance. Given the enormousvariety of work functions known toexist in the labor force, we wereunwilling to make this assumption,preferring to allow respondentsthemselves to consider all functions

that they consider relevant to thespecific requirements of their jobs inmaking global evaluations of theirperformance.

Sensitivity of the HPQ MeasuresIn light of the fact that the HPQ

work performance measure is rela-tively coarse, a question can beraised whether it is sensitive enoughto detect effects of illnesses on workperformance and of health interven-tions with moderate effects on thereduction in impaired work perfor-mance. It goes beyond the scope ofthe present report to present substan-tive results. However, in light of thisimportant concern it is worth notingthat substantive analyses of the datapresented here, which will be re-ported in separate publications, showthat the HPQ measure of work per-formance is sensitive to a variety ofillnesses as well as to standard disor-der-specific measures of illness se-verity within subsamples of respon-dents who suffer from specificdisorders. There are also statisticallyand substantively significant associ-ations of HPQ work absence andaccident-injury measures with infor-mation collected in the surveys aboutthe prevalences and severities of dis-orders.

TABLE 8Associations of self-reported hours/days worked with odds of being at work at randomly selected times in the ESMsamples1

HPQ Self-Reports2

Reservation AgentsCustomer ServiceRepresentatives

OR (95% CI) OR (95% CI)

Hours worked 2.0* (1.6–2.3) 2.0* (1.6–2.5)Days missed 0.5* (0.4–0.7) 0.5* (0.4–0.6)(np)3 (105) (181)(nb)3 (3,675) (6,335)

* Significant at the .05 level, two sided test.1 Results are based on a two-level (person and 35 random moments of time within persons) mixed regression model that controlled both

for within-person variables (number of days in the study at the time of the beep, ranging between 1–7; sequence of the beep within the day,ranging between 1–5) and for between-person variables (age, sex). Hours worked and days missed were included in separate models and weretreated as between-person variables.

2 The work measures were standardized to a mean of zero and variance of one.3 np � number of people; nb � number of beeps.

JOEM • Volume 45, Number 2, February 2003 169

Page 15: Kessler Intro Who HPQ 2003 - BDMS CME Online

Calibrating HPQ Global RatingsCalibration rules were developed

to convert ratings on the 0-to-10HPQ global rating scale into pre-dicted probabilities of high and lowarchival and ESM work performanceusing the results reported above. Thiswas done bearing in mind that thepredicted probabilities of high per-formance should not be evaluated forwhite-collar workers. These calibra-tion rules were based on methodsdeveloped to promote the use ofdiagnostic screening scales in clini-cal decision-making.51 These meth-ods allow scores on screening scaleslike the HPQ to be interpreted in newsamples by using the results of pre-diction equations developed in cali-bration samples to assign probabili-ty-of-illness scores (or, in the presentcase, probabilities of high and lowwork performance) to individualcases in the new samples.

The difficulty in developing theserules is that we cannot assume thatthe probabilities of high and lowperformance associated with a givenHPQ score in the calibration sample(positive and negative predictive val-ues) will be the same in new sam-ples. This is true, importantly, even ifthe conditional distributions of HPQratings among people with true highand low performance (sensitivity andspecificity) are constant across sam-ples, because any deviation in theproportions of workers in the newsamples with actual high and lowperformance from the 20% arbi-trarily assumed in the HPQ calibra-tion samples will lead to changes inthe positive and negative predictivevalues at a given level of the HPQrating.52 As a result, it is not appro-priate to specify a single thresholdfor the outcomes of interest (in thiscase, high and low work perfor-mance) for all populations based ongiven HPQ ratings.

This problem can be addressed inthree ways, all of which are imple-mented in software developed foruse with HPQ survey data. All threeapproaches rely on the method of

stratum-specific likelihood-ratios(SSLRs). An SSLR is an odds-ratiothat compares respondents with aspecific score on a screening scale(in the case of the HPQ global rating,a rating of either 0–7,8,9, or 10) withthose having all other ratings on thescale in terms of their odds of havinga dichotomous outcome (in the caseof the HPQ calibration, either lowperformance versus others or highperformance versus others).51 Theassumption that the sensitivities andspecificities of the relationship be-tween HPQ ratings and true perfor-mance categorization are the same innew samples as in the calibrationsamples is equivalent to the assump-tion that SSLR’s are constant acrosssamples. Based on this assumption,the SSLRs estimated in the calibra-tion samples can be used in conjunc-tion with information about the trueprevalence (converted to an odds) ofthe dichotomous outcome in the newsample to calculate individual-levelpredicted probabilities of the dichot-omous outcome. This can be done byshowing, based on Bayes’ theorem,that

POO � SSLR � ROO (1)

where POO � the population oddsof the dichotomous outcome andROO � the individual respondent’sodds of the outcome. The individu-al’s probability of the dichotomousoutcome, p, can easily be derivedfrom ROO by using the transforma-tion

ROO � p/�1–p� (2)

There are three ways to use theresults in Eqs. (1) and (2) to assignindividual-level predicted probabili-ties of high and low work perfor-mance based on individual HPQ rat-ings. As noted above, all three areimplemented in the software devel-oped for use in HPQ surveys.

The first way to assign individual-level probabilities is to fix the aggre-gate prevalences of high and lowwork performance to 20% in newsamples a priori in exactly the sameway as in the calibration samples.

High and low performance, in thisapproach, are arbitrarily definedfixed percentiles. The second way isto allow the estimated prevalences ofhigh and low work performance tobe fixed at different values based onexternal information and institutionalknowledge. For example, managersin a particular corporation mightconclude that the prevalences of highand low work performance, respec-tively, are 30 and 10% among theirwhite-collar workers, 20 and 20%among their pink-collar workers, and10 and 30% among their blue-collarworkers. These assumptions can beused to convert HPQ global ratingsinto individual-level predicted prob-abilities of high and low perfor-mance separately in each occupa-tional sub-sample by using thetransformation in Eqs. (1) and (2).The third way to assign individual-level probabilities is to estimate theprevalences of high and low workperformance empirically from thedistribution of HPQ ratings in thenew sample. This can be done byusing maximum-likelihood to com-pare empirical distributions on thisscale with the theoretical distribu-tions generated by the sensitivitiesand specificities in the calibrationsample applied to all logically possi-ble combinations of high and lowwork performance. The maximum-likelihood estimates of the preva-lences of high and low performancebased on this approach are thoseassociated with the theoretical distri-bution of HPQ global ratings mostsimilar to the empirical distributionin the sample. Once these prevalenceestimates are identified, they can beconverted to odds and used in Eq. (2)to generate individual-level probabil-ities from individual-level HPQ rat-ings.

Monetizing Absenteeism andWork Performance Ratings

It is important to remember thatabsenteeism and low work perfor-mance have quite different costsacross occupations and industries.

170 The WHO HPQ • Kessler et al

Page 16: Kessler Intro Who HPQ 2003 - BDMS CME Online

These differences are not necessarilyproportional to salaries. The un-scheduled absence of an airline res-ervation agent, for example, mightlead to customers spending some-what more time waiting before theyspeak to an agent and to agents onduty having a somewhat more hecticday than usual. But the costs of theseinconveniences to the employer areprobably minimal unless the delaysare so long and persistent that cus-tomers go to a different airline topurchase their tickets. The unsched-uled absence of an airline steward-ess, in comparison, can cause a flightdeparture delay, due to FAA staffingrequirements for number of person-nel needed for a flight to depart, thatcosts the airline at much as $5000per hour in additional gate fees. Sim-ilarly, the low performance of asalesman, if it leads to the loss of amajor contract, can cost a corpora-tion millions of dollars even thoughthe salesman’s salary is only a frac-tion of that amount. Because of situ-ations such as these, even when wehave good estimates of absenteeismand work performance, an additionalstep is required to estimate the mon-etary costs of poor performance andabsenteeism to the employer. A num-ber of approaches have been pro-posed to make these estimates,6 sev-eral of which have beenimplemented in software developedfor use in analyzing HPQ surveys.Although it is beyond the scope ofthis report to discuss these ap-proaches here, it is important to notethat this additional step is needed tomonetize the HPQ results.

Future DirectionsNationally representative general

population HPQ surveys are cur-rently being performed in 28 coun-tries around the world as part of alarger WHO initiative aimed at esti-mating the societal costs of mentaland physical illness.53 We anticipatethat over 200,000 respondents willcomplete these HPQ surveys oncethey are finished. In addition, bothpaper-pencil and internet versions of

the HPQ have been developed andare being used to carry out ongoingannual surveys of the employees of anumber of large corporations in theU.S., either as part of existing HealthRisk Appraisal surveys or as stand-alone surveys. A number of thesesurveys are being carried out in col-laboration with the National Busi-ness Coalition on Health. We willsoon be distributing an HPQ surveytoolkit to all members of the Na-tional Business Coalition on Healththroughout the United States. Anumber of other collaborations arealso in development in the UnitedStates, Canada, and Europe.

The dissemination of HPQ surveysis the first step in a larger program ofresearch aimed at pinpointing healthproblems that are associated withhigh indirect workplace costs, devel-oping and evaluating interventions toreduce these costs, and establishingquality assurance procedures to mon-itor the success of efforts to dissem-inate and maintain these interven-tions. In order to implement thisprogram of research, it is necessaryto begin by linking HPQ absentee-ism, work performance, and work-related accident-injury reports to in-formation about specific healthproblems. This is done in the HPQsurveys by asking respondents ifthey suffer from a number of com-mon chronic conditions and, if so,whether they are currently under pro-fessional treatment for these condi-tions. The chronic conditions check-list in the HPQ interview schedule isbased on the checklist used in theU.S. National Health Interview Sur-vey (NHIS). Data from the NHIS anda number of other nationally repre-sentative general population surveyswere analyzed to select the condi-tions in the HPQ checklist. The cri-teria for selection were that the con-ditions are commonly occurringamong working people and are asso-ciated either with excess work ab-sence, low work performance, or el-evated rates of work-relatedaccidents-injuries.54–56 We also in-cluded in the HPQ surveys an acute

symptoms checklist developed byKhroenke et al57 to capture the mostcommon complaints of acute-carepatients in primary care treatment.

It is noteworthy that the HPQ sur-vey distinguishes between healthproblems that are being treated andthose that are not being treated. Thisdistinction is important because mostcommon health problems that influ-ence workplace functioning varywidely in severity. Some people withseasonal allergies, for example, havevery mild symptoms while other sea-sonal allergy sufferers have very se-vere symptoms. People with severesymptoms are more likely to be intreatment than those with mild symp-toms. This makes it is impossible todetermine whether low rates of treat-ment should be considered a problemfrom the perspective of employers inthe absence of separate analyses todetermine whether untreated casesare associated with impairments inwork performance. This is performedin standardized analyses of HPQ databy distinguishing the separate effectsof treated conditions and untreatedconditions on workplace outcomes.A single yes or no question abouttreatment of each health problem isincluded in the HPQ for this purpose.More extensive questions were con-sidered for inclusion in the surveys,but subsequently rejected based onthe realization that detailed informa-tion about the treatment of specifichealth problems could be obtainedby employers from health claimsrecords. The HPQ treatment questionasks about “professional” treatment,defined as treatment by a doctor,nurse, or other health professional, toexclude self-treatment and comple-mentary and alternative medicaltreatment not provided by a healthprofessional. These exclusions areimportant in light of the growingimportance of self-treatment andcomplementary and alternative med-ical treatment.5,58

The fact that treatment is stronglyinfluenced by illness severity meansthat cross-sectional HPQ surveyscannot be used to help employers

JOEM • Volume 45, Number 2, February 2003 171

Page 17: Kessler Intro Who HPQ 2003 - BDMS CME Online

estimate the likely return on theirinvestment (ROI) because of ex-panding treatment for a particularcondition. Experimental or quasi-experimental studies are required tomake such estimates. Cross-sectionalHPQ surveys are better suited toaddress the prior questions: 1) Whichof the health problems assessed inthe HPQ survey have the greatestindirect costs in my workforce? 2)Are these costs associated with lowrates of treatment (ie, high workplacecosts among untreated workers withthe conditions), inadequate treatment(ie, high workplace costs amongtreated workers with the conditions),or both? 3) How do the indirectworkplace costs of target illnesses inmy workforce compare with those inother benchmark populations?

Answers to these questions canhelp employers pinpoint health prob-lems that have particularly high indi-rect workplace costs. Systematic re-views of the treatment effectivenessliterature can then be used to evalu-ate the likelihood that enhanced out-reach and/or treatment efforts aimedat these conditions would yield alarge enough reduction in workplacecosts to have a positive return oninvestment. Ongoing HPQ monitor-ing surveys can then be used tocalculate ROI of new interventionsbased on such considerations. Thiscan be done in a single corporationwith a universal intervention using abefore-after interrupted time seriesdesign or a quasi-experimental case-control design that compareschanges in the HPQ ratings amongworkers with the target conditions incorporation that do, versus do not,implement the intervention. In a cor-poration that has multiple sites andthat can assign new treatment pro-grams to a subset of these sites, amore powerful before-after case ver-sus control test market design can beused to evaluate the ROI of theintervention.

A large experimental treatment ef-fectiveness trial is currently under-way in conjunction with ongoingHPQ surveys that illustrates some of

these ideas about the evaluation oftreatment interventions. This trial isevaluating the effects of detectingand treating working people withmajor depression.11,21,59 The inter-vention features outreach and best-practices treatment provided byUnited Behavioral Health (UBH),one of the largest behavioral healthcarve-out companies in the country.Baseline HPQ surveys are beingused to screen for major depressionamong workers with UBH coveragein a number of large corporations.UBH care managers are implement-ing an outreach and treatment pro-gram to a random sub-sample ofthese workers using an intent-to-treatexperimental design. Expanded fol-low-up HPQ surveys are being usedto evaluate the return on investment(ROI) of the intervention over a2-year follow-up period. Our hope isthat this experiment will serve as amodel for future interventions andevaluations using the HPQ.

AcknowledgmentsThe authors gratefully acknowl-

edge the help of questionnaire word-ing experts Stephanie Chardoul, LouMagilavy, Beth Ellen Pennell, KentPeterson, Tom Wilkinson, and Deb-bie Zivan in developing and revisingthe HPQ, the assistance of Bill Whit-mer and his colleagues at the HealthEnhancement Research Organization(HERO) in recruiting companies toparticipate in the calibration survey,the staff of the participating compa-nies in facilitating the sample selec-tion and abstraction of objective per-formance measures and absenteeismrecords, and Joe Myer, Tom Wilkin-son, and their staff at DataStat, Inc.in carrying out the calibration sur-vey. The complete text of the HPQ isposted at the http://www.hcp.med-.harvard.edu/hpq.

Supported by The World HealthOrganization, the John D. and Cathe-rine T. MacArthur Foundation, andby unrestricted educational grantsFrom Glaxo-Wellcome, Pfizer,Searle and Schering-Plough.

References1. Sloan FA. Valuing Health Care: Costs,

Benefits, and Effectiveness of Pharma-ceuticals and Other Medical Technolo-gies. New York: Cambridge UniversityPress; 1996.

2. Gold MR, Siegel JE, Russell LB, et al.Cost-Effectiveness in Health and Medi-cine. Oxford: Oxford University Press;1996.

3. Patrick DL, Erickson P. Health Statusand Health Policy: Quality of Life inHealth Care Evaluation and ResourcedAllocation. New York: Oxford UniverityPress; 1993.

4. Tarlov AR, Ware JE Jr., Greenfield S, etal. The Medical Outcomes Study. Anapplication of methods for monitoring theresults of medical care. JAMA. 1989;262:925–930.

5. Kessler RC, Davis RB, Foster DF, et al.Long-term trends in the use of comple-mentary and alternative medical therapiesin the United States. Ann Intern Med.2001;135:262–268.

6. Pauly MV, Nicholson S, Xu J, et al. Ageneral model of the impact of absentee-ism on employers and employees. HealthEcon. 2002;11:221–231.

7. Greenberg PE, Kessler RC, Nells TL, etal. Depression in the workplace: an eco-nomic perspective. In: Feighner JP,Boyer WF, eds. Selective Serotonin Re-uptake Inhibitors: Advances in Basic Re-search and Clinical Practice. New York:John Wiley & Sons Ltd.; 1996.

8. Kessler RC, Zhao S, Katz SJ, et al.Past-year use of outpatient services forpsychiatric problems in the National Co-morbidity Survey. Am J Psychiatry.1999;156:115–123.

9. Coulehan JL, Schulberg HC, Block MR,et al. Treating depressed primary carepatients improves their physical, mental,and social functioning. Arch Intern Med.1997;157:1113–1120.

10. Wells KB, Sherbourne C, SchoenbaumM, et al. Impact of disseminating qualityimprovement programs for depression inmanaged primary care: a randomizedcontrolled trial. JAMA. 2000;283:212–220.

11. Kessler RC, Barber CB, Birnbaum HG,et al. Depression in the workplace: ef-fects on short-term work disability.Health Aff. 1999;18:163–171.

12. Boonen A, van der Heijde D, Landewe R,et al. Work status and productivity costsdue to ankylosing spondylitis: compari-son of three European countries. AnnRheum Dis. 2002;61:429–437.

13. Reginster JY, Khaltaev NG. Introductionand WHO perspective on the global bur-

172 The WHO HPQ • Kessler et al

Page 18: Kessler Intro Who HPQ 2003 - BDMS CME Online

den of musculoskeletal conditions. Rheu-matology. 2002;41:1–2.

14. Kessler RC, Almeida DM, Berglund P, etal. Pollen and mold exposure impairs thework performance of employees withallergic rhinitis. Ann Allergy Asthma Im-munol. 2001;87:289–295.

15. Rosenheck RA, Druss B, Stolar M, et al.Effect of declining mental health serviceuse on employees of a large corporation.Health Aff. 1999;18:193–203.

16. Lynch W, Riedel JE. Measuring Em-ployee Productivity: A Guide to Self-Asessment Tools. Scottsdale, AZ: TheInstitute for Health and ProductivityManagement; 2001.

17. Rehm J, Ustun TB, Saxena S, et al. Onthe development and psychometric test-ing of the WHO screening instrument toassess disablement in the general popula-tion. Int J Methods Psychatric Res. 1999;8:110–123.

18. World Health Organization. Interna-tional Classification of Functioning, Dis-ability and Health. Geneva: WorldHealth Organization; 2001.

19. Converse JM, Presser S. Survey Ques-tions: Handcrafting the StandardizedQuestionnaire. In: Sage University PaperSeries, 63:Quantitative Applications inthe Social Sciences. Beverly Hills: Sage;1986.

20. Sudman S, Bradburn NM, Schwartz N.Thinking about Answers: The Applica-tions of Cognitive Processes to SurveyMethodology. San Francisco, CA: Jos-sey-Bass; 1996.

21. Wang PS, Beck AL, McKenas DK, et al.Effects of efforts to increase responserates on a workplace chronic conditionscreening survey. Med Care. 2002;40:752–760.

22. Blum TC, Roman PM. Alcohol consump-tion and work performance. J Stud Alco-hol. 1993;54:61–70.

23. Harbour JL. The Basics of PerformanceMeasurement. New York: Productivity,Inc.; 1997.

24. Grote D. The Complete Guide to Perfor-mance Appraisal. New York: AMA-COM; 1996:400.

25. U. S. Department of Labor Employmentand Training Administration. Testing andAssessment: An Employer’s Guide toGood Practices. Washington, DC: U. S.Government Printing Office; 1999.

26. Matheson LN, Kaskutas V, McCowan S,et al. Development of a database of func-tional assessment measures related towork disability. J Occup Rehabil. 2001;11:177–199.

27. Holloway J, Lewis J, Mallory G. Perfor-mance Measurement and Evaluation.Thousand Oaks, Ca: Sage; 1995.

28. Pritchard RD, Holling H, Lammers F, etal. Improving Organizational Perfor-mance with the Productivity Measure-ment and Enhancement System: An Inter-national Collaboration. Huntington, NY:Nova Science; 2002.

29. Whetzel DL, Wheaton GR. Applied Mea-surement Methods in Industrial Psychol-ogy. Cleveland: Davis-Black; 1997.

30. Neal A, Griffen MA, Paterson J, et al.Development of measures of situationalawareness, task performance, and contex-tual performance in air traffic control. In:Lowe AR, Hayward BJ, eds. AviationResource Management. Aldershot, UK:Ashgate; 2000:241–267.

31. Sawyer JE, Latham WR, Pritchard RD, etal. Analysis of work group productivityin an applied setting: Application of atime series panel design. Personnel Psy-chol. 1999;52:927–967.

32. Warr PB, Conner MT. The Measurementof Personal Effectiveness for Review andGuidance. London: Department of Edu-cation and Employment; 1999.

33. Endicott J, Nee J. Endicott Work Produc-tivity Scale (EWPS): a new measure toassess treatment effects. Psychopharma-col Bull. 1997;33:13–16.

34. Koopman C, Pelletier KR, Murray JF, etal. Stanford presenteeism scale: healthstatus and employee productivity. J Oc-cup Environ Med. 2002;44:14–20.

35. Bergner M, Bobbitt RA, Carter WB, et al.The Sickness Impact Profile: develop-ment and final revisions of a health statusmeasure. Med Care. 1981;19:787–805.

36. Lerner D, Amick RC, Rogers WH, et al.The Work Limitations Questionnaire.Med Care. 2001;39:72–85.

37. U. S. Department of Labor. Dictionary ofOccupational Titles. Washington, DC:Labor Dept., Employment and TrainingAdministration, United States Employ-ment Service; 1991:1445.

38. Cain P. An assessment of the Dictionaryof Occupational Titles as a source ofoccupational information. In: Miller AR,Treiman DJ, Cain PS, et al. (Eds.), Work,Jobs, and Occupations: A Critical Re-view of the Dictionary of OccupationalTitles. Washington: National AcademyPress; 1980:148–195.

39. Cullum CM, Saine K, Chan LD et al.Performance-based instrument to assessfunctional capacity in dementia: TheTexas Functional Living Scale. Neuro-psychiatry Neuropsychol Behav Neurol.2001;14:103–108.

40. Whetstone LM, Fozard JL, Metter EJ, etal. The physical functioning inventory: aprocedure for assessing physical functionin adults. J Aging Health. 2001;13:467–493.

41. Means B, Loftus EF. When personalhistory repeats itself: Decomposingmemories for recurring events. Appl Cog-nit Psychol. 1991;5:297–318.

42. Menon A. Judgements of behavioral fre-quencies: Memory search and retrievalstrategies. In: Schwartz N, Sudman S,eds. Autobiographical Memory and theValidity of Retrospective Reports. NewYork: Springer-Verlag; 1994:161–172.

43. Oksenberg L, Vinokur A, Cannell CF.Effects of commitment to being a goodrespondent on interview performance. In:Cannell CF, Oksenberg L, Converse JM,eds. Experiments in Interviewing Tech-niques. DHEW Publication No. (HRA)78–3204. Washington: Department ofHealth, Education, and Welfare; 1979:74–108.

44. Schwarz N. Self-reports: how the ques-tions shape the answers. Am Psychol.1999;54:93–105.

45. Revecki DA, Irwin D, Reblando J, et al.The accuracy of self-reported disabilitydays. Med Care. 1994;32:401–404.

46. Csikszentmihalyi M, Larson R. Validityand reliability of the Experience Sam-pling Method. In: deVries M, ed. TheExperience of Psychopathogy: Cam-bridge University Press; 1992:43–57.

47. Stone AA, Kessler RC, HaythornthwaiteJA. Measuring daily events and experi-ences: decisions for the researcher. JPers. 1991;59:575–607.

48. Campbell D. Campbell Leadership Index.Greensboro, NC: Center for CreativeLeadership; 1991.

49. Hanley JA, McNeil BJ. A method ofcomparing the areas under receiver oper-ating characteristic curves derived fromthe same cases. Radiology. 1983;148:829–843.

50. Peirce JC, Cornell RG. Integrating stra-tum-specific likelihood ratios with theanalysis of ROC curves. Med Decis Mak-ing. 1993;13:141–151.

51. Guyatt G, Rennie D. User’s Guide to theMedical Literature: A Manual for Evi-dence-Based Clinical Practice. Chicago:AMA Press; 2001.

52. Rothman KJ. Epidemiology: An Intro-duction. New York: Oxford UniversityPress; 2002:223.

53. Kessler RC, Ustun TB. The World HealthOrganization World Mental Health 2000Initiative. Hosp Manag Int. 2000;195–196.

54. Kessler RC, Frank RG. The impact ofpsychiatric disorders on work loss days.Psychol Med. 1997;27:861–873.

55. Kessler RC, Greenberg PE, MickelsonKD, et al. The effects of chronic medicalconditions on work loss and work cut-

JOEM • Volume 45, Number 2, February 2003 173

Page 19: Kessler Intro Who HPQ 2003 - BDMS CME Online

back. J Occup Environ Med. 2001;43:218–225.

56. Kessler RC, Mickelson KD, BarberCB, et al. The association betweenchronic medical conditions and workimpairment. In: Rossi AS, ed. Caringand Doing for Others: Social Respon-sibility in the Domains of Family,Work, and Community. Chicago, IL:

University of Chicago Press; 2001:403– 426.

57. Kroenke K, Spitzer RL, Williams JB.The PHQ-15: validity of a new measurefor evaluating the severity of somaticsymptoms. Psychosom Med. 2002;64:258–266.

58. Eisenberg DM, Kessler RC, Van RompayMI, et al. Perceptions about complemen-

tary therapies relative to conventionaltherapies among adults who use both:Results from a national survey. Ann In-tern Med. 2001;135:344–351.

59. Simon GE, Barber C, Birnbaum HG, etal. Depression and work productivity: thecomparative costs of treatment versusnontreatment. J Occup Environ Med.2001;43:2–9.

174 The WHO HPQ • Kessler et al