a program for training and feedback about probability estimation for physicians

Computer Methods and Programs in Biomedicine 22 (1986) 27-33 27 Elsevier

CPB 00746

A program for training and feedback about probability estimation for physicians

D e n n i s G . F r y b a c k

Departments of Industrial Engineering and Preventive Medicine, Unioersity of Wisconsin, 1513 University Aoenue, Madison, WI 53706, U.S.A.

Medical decisions are rarely made under conditions of complete certainty. In the past decade there has been a rapid growth of interest in formal methods for optimizing medical decisions under uncertainty [1,2]. Application of decision-analytic methods requires physicians to make probability estimates about clinical events for which extensive data are not available. This paper describes a computer program to train physicians to be better probability estimators: to make probability estimates that are numerically meaningful for use in formal decision analyses. It is designed to be a stand-alone application requiring about 2 hours of physician time. Use requires an IBM-PC or compatible microcom- puter with graphics adaptor and monitor, and 8087 coprocessor.

Probability estimation Training Feedback

1. Introduction

Medica l decis ion making requires physic ians to weigh uncer ta in t ies abou t the cl inical s i tua t ion and therapeut ic a l ternat ives against poss ib le costs and benef i ts of the var ious pa t i en t m a n a g e m e n t strategies. F o r m a l op t imiza t ion methods for making such decis ions require quant i f ica t ion of the uncer ta in t ies and the values involved [e.g. 3].

The advent of compute r i zed cl inical records systems will accelerate our abi l i ty to make actu- ar ia l es t imates of the p robab i l i t i e s under ly ing many cl inical decis ion problems. In spite of such systems there will a lways be a subject ive e lement to es t ima- t ion of p robab i l i t i e s to be used in medica l decis ion making. But the same compute r ized records systems that faci l i ta te s ta t is t ical es t imat ion of p rob- abi l i t ies can also be used as a d a t a col lect ion and feedback mechan i sm to t ra in ind iv idua l physic ians to make as accura te and valid p robab i l i t y assess-

ments as possible. Al though this form of decis ion suppor t will not be realized in the n e a r future, there may be in ter im methods for improving physic ians ' abil i t ies to quant i fy their subject ive probabi l i t i es as inputs to cl inical decision analyses. The purpose of this pape r is to present the design of a compute r -based t ra ining p rog ra m to improve phys ic ians ' p robab i l i t y est imates.

2. Background: calibration of probability estimates

It is not easy to def ine what is a ' good ' numer ica l es t imate of a p robab i l i t y for an event for which there is no ac tuar ia l es t imate [4]. If for a par t i cu la r pa t ien t the phys ic ian es t imates the p robab i l i t y of disease to be 0.10 and the pa t ien t turns out to have the disease, was the es t imate good or bad? We canno t say.

However , if we look at a group of pat ients ,

0169-2607/86/$03.50 © 1986 Elsevier Science Publishers B.V. (Biomedical Division)

28

rather than at a particular instance, we can judge how good the physician is as a source of probability estimates. For example, in the set of instances where the probability of disease was estimated to be 0.10 we should find that about 10% of these patients did in fact have the disease and 90% did not. A similar correspondence should hold for other numerical levels of probability estimates. If a physician's probability estimates tend to be matched by outcome percentages in this fashion that physician is said to be 'well calibrated'. A graph showing the correspondence between the estimated probability levels and outcome percentages is termed a calibration plot.

Lichtenstein, Fischhoff and Phillips [5] sum- marize the cognitive psychology literature concerning human ability to make calibrated probability estimates or to be trained to do so. In laboratory studies subjects show modest calibration and modest susceptibility to training if given individualized feedback concerning their own calibration. Results from observing experts in the field are mixed. Weather forecasters issuing precipitation probabilities are superbly calibrated. Handicappers setting odds for horse races seem excellently calibrated for estimates in the range from 0 to 0.4, but are not so well calibrated above 0.4. Calibration of physicians seems to be modest to poor. Lichtenstein et al. offer several possible reasons for this, among which is sporadic nature of final outcome resolution of many of the uncertainties in medical decision making (e.g. many patients are lost to timely follow-up) and lack of systematic record keeping and feedback about clinical judgments.

Besides the calibration of probability estimates, it is desirable that they be as informative as possible. After all, it is possible to be perfectly calibrated by always assessing the probability to be the population base rate for the event in question. If an average 4% of patients with head trauma in a particular setting have a fracture, then estimating the probability always to be 4% for each patient would result in perfect calibration, but not be very informative about particular patients. We should observe that a set of probabilities that is calibrated will be informative if the probabilities exhibit variance and are correlated with outcomes of the events to which they refer. Technically they should

imply an area that tends to 1.0 under an ROC curve for discriminating event outcomes (see dis- cussion below).

The computer program described below trains the user to attend to both calibration and informativeness of probability estimates.

3. System description

3.1. Underlying model and general design of the program

The program is written in C-86 to run on an IBM-PC with graphics adaptor and 8087 mathe- matics coprocessor.

The purpose of the program is to provide training to improve the calibration of an individual physician's probability estimates while main- taining information content of the probabilities. Its design presumes a model for subjective probability estimation analogous to psychophysical scaling.

The physician is presumed to integrate inter- nally the pertinent information about the event for which the probability is being estimated. This process results in an internal, subjective response whose magnitude is in some fashion proportional to degree of subjective certainty. Through some introspective process the physician perceives this internal response. Making a probability estimate is the act of consciously assigning a number to represent the magnitude of the internal response of subjective certainty.

The purpose of feedback about these probability estimates is to assist the physician in learning to make numerical estimates that are representa- tive of the internal response and that externally can be shown to be calibrated.

This model for psychophysical scaling and feedback training could equally well apply to estimating weights of lifted objects. The subject lifts the object in his /her hand, producing an internal sensation of weight, and then makes a numerical estimate of the weight in, say, grams. Feedback about the actual weight is then given, and the process repeated with another object. Within a reasonable range of weight, the person

will become quite well calibrated in the sense the estimates will tend to match actual weights.

The probability application of this model is complicated by the fact that the ' t rue ' probability for a single event is not known, so that we are unable to give feedback about a single event - we can only inform the physician about the calibration for a set of estimates. How the feedback program handles this complication is explained later.

The probability estimation feedback program proceeds in a similar fashion to the lifted weights example above. A stimulus is presented that evokes an internal subjective certainty response. The user enters a numerical probability to represent this internal response. Then feedback is given about calibration and informativeness of the user's probability estimates, and the cycle repeated. The purpose of the repetition is to give as much practice as possible in the process of assigning numbers to represent varying degrees of internal certainty. The purpose of the feedback is to make immediately visible to the user properties of h is /her numerical probability assessments that normally are not monitored or quantified without considerable data collection and computational effort.

3.2. Specific design features

3.2.1. Stimuli to eooke certainty leoels The program uses a series of 159 multiple choice, general medical knowledge questions. The questions were adapted from preparation materials for the Board Examination in General Internal Medi- cine. They were selected to vary in difficulty, and edited to length requirements necessitated by hav- ing to appear in a limited region of a computer screen. The number of answer choices varies from 2 to 5 with 4 and 5 answer questions predominat- ing. An example question is:

A 27 yo m who had had home intravenous hyper- alimentation (TPN) from two years because of regional enteritis, consults you because of back pain. X-rays of his thoracic and lumbar spine showed diffuse osteopenia. On further evaluation you would expect to find which of the following? 1. Calcium 7.8 m g / d l 2. Phosphorus 2.1 m g / d l

29

3. Alkaline phosphatase 150 IU/1 4. Magnesium 1.1 m g / d l 5. 25-Hydroxycholecalciferol 35 n g / m l (15 to 40

ng/rnl)

3.2.2. Probability estimation and feedback process The user is instructed to enter a percentage probability estimate for each answer in turn, starting with answer number 1. The probability assessments are to represent the likelihood that the Board Examiners considered the particular answer to be the one correct answer for the question.

The sole intent of the question is to stimulate varying levels of subjective certainty - here, about which answer is the correct one. The user responds to each possible answer, entering a percentage probability that the answer is the correct one for the question. After estimates are entered for every answer to a question, the program highlights the correct answer and updates running statistics about calibration and informativeness of the user's estimates in general. These running statistics are shown as time series graphs and numerical summaries on the lower half of the screen.

As the user works through the series of questions he / she is encouraged to experiment by making estimates that over- and understate confidence in order to see the effect on calibration and informativeness of the probability estimates. The program does not require the percentage probability estimates to add to 100%, although as an assistance to the user the program displays the difference between 100% and the cumulative percentage for all answers to that point in the multiple choice question so no mental arithmetic need be done if the user wishes to make his estimate add to 100%.

During use it is possible to select for continual display any two of eight possible graphical presen- tations of measures of calibration and informativeness of the probability estimates. Each display requires slightly less than one-quarter of the screen area on the IBM monitor and is displayed in three-tone (black, grey, white) pictures using the high resolution graphics display on the IBM graphics monitor.

The natures of these measures and displays will be described in the sections that follow.

30

3.3. Measures of calibration

The program computes four measures of calibration. Yates [6] describes computational formulas for each. The same formulas are used by the program except that they have been rescaled for use with percentage probability estimates.

Bias is the signed deviation between the average probability estimate and the average relative frequency of correct answers. In the present application this reduces to the average amount by which the user's estimates exceed or fall short of a total of 100% per question. In a clinical setting it is the difference between the average estimate and the actual overall base rate. For example, if a physician's average estimate for probability of strep cultures being positive were 35%, and overall only 25% of the cultures for which he made estimates came back positive, the bias is + 10%.

Yates defines 'calibration in the large' to be the squared Bias.

'Calibration-in-the-small' is an average squared deviation from relative frequencies for each dis- tinct level of probability estimate. In other words, for each probability estimate level used (e.g. 10%) all question answers for which this estimate was made are examined and the proportion of them that were the correct answer is computed. The difference between the estimate and the actual relative frequency is computed and squared. These squared differences are weighted by the number of times the particular estimate was used, and then summed. This is a measure of deviation from perfect empirical calibration as discussed earlier.

A plot of the pairs of probability levels and observed relative frequencies is called an 'empirical' calibration plot. Empirical plots are not displayed since the relative frequency at each probability level may be based on a different number of data points. Since the relative frequency is merely an empirical estimate of a binomial parameter the estimates represented by the points will have considerably different variances, making the plot difficult to interpret. Instead, then, a different technique is used to obtain a calibration plot: the program displays a calibration plot using logistic regression. The plot is a graph of the equa- tion,

exp(a + b - p ) P' = 1 + exp(a + b - p )

where p' is the regression estimate of the relative frequency that corresponds to p, the subjectively estimated probability. The constants a and b are fitted from the data using the method of maximum likelihood. The logistic function is parti.cularly suited to this interpretation and usage [7].

This logistic regression function is displayed for values of p ranging from 0 to 100% with the user's percent probability estimate on the abscissa and the logistic regression estimate of the true probability, given the user's estimate, on the ordinate. The positive diagonal on this plot represents perfect calibration of estimates. This plot enables the user to see graphically the adjustment he should make to his numerical assignments throughout the probability scale to become better calibrated.

The numerical scores for the first three measures should all be minimized with perfect calibration. The user should try to drive the logistic function in the calibration plot toward the positive diagonal.

3.4. Indicators of informativeness

Four different ways are used to indicate how well correct answers can be discriminated from incorrect answers using the probability estimates.

Conditional means are computed for probability estimates given for correct answers and for estimates given for incorrect estimates. Ideally, an average estimate of 100% should be given to correct answers, and an average of 0% should be given to incorrect answers.

The probability score computed by the program is the Brier or quadratic probability score [8]. It is the average squared difference between the user's probability estimates and those that a clairvoyant would make (i.e. 100% for correct answers and 0% for incorrect answers). Ideally the probability score should be zero. The program also computes the running probability score for a hypothetical esti- mator who says 50% for all two-answer questions, 33% for all three-answer, 25% for all four-answer questions, etc. This is a guaranteed upper bond on the probability score that the user should try to

improve on (by achieving a lower score). The histogram option displays a frequency

histogram of estimates assigned to correct answers and displays this next to a similar histogram for estimates assigned to incorrect answers. The user can thus see simultaneously the entire distribu- tions for which the conditional means are a summary.

Finally, the area under the R O C curve describ- ing the user's ability to discriminate correct from incorrect answers is computed using the method of Hanley and McNeil [9]. This area is meaningful in a precise way. Imagine all answers which were correct as one population and all answers which were incorrect as another. Imagine further that one answer is picked at random from each population and the chosen two are displayed (with their associated questions) to the user. The user's task in this imaginary scenario is to indicate which was selected from among the correct answers. If he cannot discriminate correct from incorrect answers at all, the probability of picking the correct answer is 0.5, the chance level; if he can perfectly discriminate, the probability is 1.0. We expect the user to be somewhere in between according to his imperfect, but non-zero discriminating ability as revealed in his probability estimates. The area under the ROC curve estimates the probability of the user picking the correct answer in this ideal- ized task. It is thus a direct measure of the informativeness of the set of probabilities.

3. 5. Time series calculations

All of the eight measures listed above are relatively unstable statistics; they exhibit a good deal of variance in small samples. Thus, to give a statistically accurate estimate of probability estimation performance it is desirable to compute them over a large number of probability estimates. Unpub- lished Monte Carlo simulations by this writer indicate that ' large number ' here means on the order of 200 to 300 estimates. However, as users proceed through a sequence of probability estimates, they are expected to change (learn) due to the experience so that, by the time 300 estimates are accu- mulated, it may be that the first 200 are irrelevant to current performance. (This is analogous to the

31

problem Of using mortality and cure rate statistics from early experience with bypass surgery to estimate current risk and efficacy of the procedure.)

We are thus in a dilemma. A large number of estimates are needed for a statistically stable pic- ture of performance. And only the few most recent estimates may be relevant to current abilities. The program resolves this dilemma by using a sliding window of estimates to make all calculations, i.e. only the most recent 50 estimates are used in the computations outlined above. This window of data may be thought of as sliding forward in the data stream over time. Upon the 50th estimate entered in the system, all statistics are computed for the first window. The next estimate, the 51st, gener- ates a new window by dropping the 1st estimate, making the current window comprise estimates number 2 through 51, and so forth.

Each of the summary statistics (excluding the pictorial summaries provided by the histograms and the logistic regression) is computed for every window and plotted serially as a time series that scrolls across the 1 / 4 screen graphical area. By this means the user can observe changes over time and trends in the different measures. These time series are updated after every multiple choice ques-

13, Diet l t~ ( e a t ~ implicated in cancee or" t ie lung

5 z--)l, 10u dieta~ intake 0( (ike z--)2, high d i e t ~ intake o( tat z--)3, hish conduction o(._cof(ee

! i " ' - - I . - . . . . . . . . . . . . . . . . . . , h .

Fig. 1. Typical screen display. User has entered probability estimates for each answer and program has highlighted answer 4 as the correct one. Selected displays are the histogram (bottom left) and the logistic regression plot (bottom right). Other display options are shown in the bar above the plots. Choosing 'Timeseries' gives the option menu shown in Fig. 2.

32

15, ~ 5Q yete o14 wan wit]~ alcoholism is adnitted to the hespital with copious ~ilateeaL clone nasal dischtege, Ihe glucose level in nasal secretion is 8i ~/cll,, Y~ay of the p~anasal sinues is negative, Hhich at the tnlleeing is ¢~ative?

5 7,--)1, ]~atNent of dental akoss 18 ~,--)L gadoscop9 and ~eal o( foreign body 1i ~,--)3, Retie o( nasal segtal defect

[l~J 15 7,--)5, Immotheea~ of alle~ic ehinitis

NIN~O~ HIN~

Fig. 2. An alternative display of conditional means (left) and area under the ROC curve (right), each plotted as a timeseries across windows (see text). User can choose any two of 7 timeseries of 3 current window displays. The display can be changed after each question or left in place, at the user's option.

tion so that the user can immediately see changes implied by the last few probability estimates.

Because the current program sets the window size at 50 estimates, the different measures plotted in time series are very responsive to changes in calibration and resolution exhibited by the probability estimates in the window. This allows the user to experiment by adopting different attitudes of conservatism or extremism in making estimates in order to see their effects. In this manner the probability estimation feedback program becomes almost a real-time analog device to make visible what is normally an implicit, invisible, internal process by which probability estimates are formed and expressed. It is hoped that through this capa- bility the program can serve as a self-administered teaching device for probability estimation training.

Figs. 1 and 2 illustrate screen appearance during use of the program. The current multiple choice question is shown at the top of the screen, with the user's estimates entered next to each answer. The correct answer is highlighted. The highlighted bar across the middle of the screen shows display options. Any two graphs (time series or logistic regression and histogram for the current window) can be selected for display at the bot tom of the

screen. These are automatically updated after every question. Between questions the user can re- view all graphs and change the current selection of graphs for continuous display.

4 . S t a t u s r e p o r t

Anecdotal use of the program already shows it to be a creditable teaching instrument to illustrate the meaning of probability estimates and of the several summary measures of calibration and informativeness. But whether it can actually change the ability of physicians to make probability state- ments can be subjected to formal test.

Physicians at several sites around the United States have expressed interest in participating in a test of the training effect of the program. The probabilities of interest are collected when these physicians order laboratory tests: at the time of ordering the tests for a clinical series of tests they record their probability estimate that the test result will be positive (show the patient to be abnormal). Later, the test results will be collected and paired with the prospective probability estimates.

The hypothesis to be tested is that a one- to two-hour training session with the probability estimation feedback program will measurably improve the calibration of this type of probability estimate. Because of the relative instability of the summary statistics for calibration, no one physician will have enough data to test this hypothesis with one individual. But the collective data, say 30 estimates per physician before training and 30 after, for 10 physicians in a treatment group, and 60 sequential estimates by each of 10 physicians in a control group, should be sufficient to detect a collective improvement in calibration.

The laboratory studies cited by Lichtenstein et al. [3] make the hypothesis of positive effect of training a viable one within an artificial setting. It if can be demonstrated that this sort of training with an artificial probability estimation environ- ment can in fact transfer to a real-world, profes- sional setting, the cognitive psychology literature will be extended. And, more importantly, the via- bility of a continual feedback training function

embedded in a computerized clinical records sys-

tem will be established.

Acknowledgements

The assistance of Richard Gregg in the develop- men t of the medically oriented questions, and of

A n n e Johnson for computer layout design and programming, is gratefully acknowledged. This work was performed while the author was on leave

from the Universi ty of Wisconsin at the Lister Hill

Na t iona l Center for Biomedical Communica t ions , Nat iona l Library of Medicine, Bethesda, M D

20894.

This article is based on 'A Program for Tra in-

ing and Feedback About Probabi l i ty Est imat ion for Physicians ' by Dennis G. Fryback, Nat iona l

Library of Medicine, Bethesda, M D appearing in Ninth Annual Symposium on Computer Applications in Medical Care, Baltimore, MD, November 10-13, 1985, pp. 202. © 1985 IEEE.

33

References

[1] H.V. Fineberg, Medical decision making and the future of medical practice (Editorial), Medical Decision Making, 1 (1981) 4-6.

[2] L.B. Lusted, A society and a journal (Editorial), Medical Decision Making, 1 (1981) 7-9.

[3] S.G. Pauker and J.P. Kassirer, The threshold approach to clinical decision making, N. Engl. J. Med. 302 (1980) 1109-1117.

[4] D.G. Fryback, Decision maker, quantify thyself!, Medical Decision Making, 5 (1985) 51-60.

[5] S. Lichtenstein, B. Fischoff and L.D. Phillips, Calibration of probabilities: The state of the art to 1980, in Judgment under Uncertainty: Heuristics and Biases, eds. D. Kahne- man, P. Slovic and A. Tversky, pp. 307-334 (Cambridge University Press, Cambridge, 1982).

[6] J.F. Yates, External correspondence: Decompositions of the mean probability score, Organizational Behavior Hu- man Performance, 30 (1982) 132-156.

[7] J.A. Anderson, Logistic discrimination, in Handbook of Statistics, Vot. 2, eds. P.R. Krishnaiah and L.N. Kanal, pp. 169-191 (North-Holland Publishing Company, Amster- dam, 1982).

[8] Shapiro, AR: The evaluation of clinical predictions: A method and initial application, N. Engl. J. Med. 296 (1977) 1509-1514.

[9] J.A. Hanley and B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982) 29-36.

a program for training and feedback about probability estimation for physicians

Documents