the consistency, coherence and calibration of holistic, decomposed and recomposed judgemental...

15
Journal of Forecasting, Vol. 7, 185-199 (1988) The Consistency, Coherence and Calibration of Holistic, Decomposed and Recomposed Judgemental Probability Forecasts GEORGE WRIGHT Bristol Business School, Bristol BS 16 I0 Y, England CAROL SAUNDERS and PETER AYTON Decision Analysis Group, Psychology Department, City of London Polytechnic, London El 7NT ABSTRACT In this paper we make an empirical investigation of the relationship between the consistency, coherence and validity of probability judgements in a real-world forecasting context. Our results indicate that these measures of the adequacy of an individual’s probability assessments are not closely related as we anticipated. Twenty-nine of our thirty-six subjects were better calibrated in point probabilities than in odds and our subjects were, in general more coherent using point probabilities than odds forecasts. Contrary to our expectations we found very little difference in forecasting response and performance between simple and compound holistic forecasts. This result is evidence against the ‘divide-and-conquer’ rationale underlying most applications of normative decision theory. In addition, our recompositions of marginal and conditional assessments into compound forecasts were no better calibrated or resolved than their holistic counterparts. These findings convey two implications for forecasting. First, untrained judgemental forecasters should use point probabilities in preference to odds. Second, judgemental forecasts of complex compound probabilities may be as well assessed holistically as they are using methods of decomposition and recomposition. In addition, our study provides a paradigm for further studies of the relationship between consistency, coherence and validity in judgemental probability forecasting. KEY WORDS Probability Judgements Judgemental Forecasting Calibration Cross-impact Analysis Judgemental probabilities are prime inputs to many management decision-aiding technologies such as decision analysis, cross-impact analysis and fault-tree analysis. These subjective probabilities are, essentially, probability forecasts of the likelihood of occurrence of future events. Psychological studies of subjective probability estimation have tended to involve 0277-6693/88/030185- 1 S$07.50 0 I988 by John Wiley & Sons, Ltd. Received April 1987 Revised January 1988

Upload: george-wright

Post on 11-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

Journal of Forecasting, Vol. 7, 185-199 (1988)

The Consistency, Coherence and Calibration of Holistic, Decomposed and Recomposed Judgemental Probability Forecasts

GEORGE WRIGHT Bristol Business School, Bristol BS 16 I0 Y, England

CAROL SAUNDERS and PETER AYTON Decision Analysis Group, Psychology Department, City of London Polytechnic, London E l 7NT

ABSTRACT

In this paper we make an empirical investigation of the relationship between the consistency, coherence and validity of probability judgements in a real-world forecasting context. Our results indicate that these measures of the adequacy of an individual’s probability assessments are not closely related as we anticipated. Twenty-nine of our thirty-six subjects were better calibrated in point probabilities than in odds and our subjects were, in general more coherent using point probabilities than odds forecasts. Contrary to our expectations we found very little difference in forecasting response and performance between simple and compound holistic forecasts. This result is evidence against the ‘divide-and-conquer’ rationale underlying most applications of normative decision theory. In addition, our recompositions of marginal and conditional assessments into compound forecasts were no better calibrated or resolved than their holistic counterparts.

These findings convey two implications for forecasting. First, untrained judgemental forecasters should use point probabilities in preference to odds. Second, judgemental forecasts of complex compound probabilities may be as well assessed holistically as they are using methods of decomposition and recomposition. In addition, our study provides a paradigm for further studies of the relationship between consistency, coherence and validity in judgemental probability forecasting.

KEY WORDS Probability Judgements Judgemental Forecasting Calibration Cross-impact Analysis

Judgemental probabilities are prime inputs to many management decision-aiding technologies such as decision analysis, cross-impact analysis and fault-tree analysis. These subjective probabilities are, essentially, probability forecasts of the likelihood of occurrence of future events. Psychological studies of subjective probability estimation have tended to involve

0277-6693/88/030185- 1 S$07.50 0 I988 by John Wiley & Sons, Ltd.

Received April 1987 Revised January 1988

Page 2: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

186 Journal of Forecasting Yo/. 7, Iss. No. 3

artificial laboratory tasks, or simple paper and pencil exercises, which have revealed limitations and biases in human thinking relative to normative probability theory. For a review, see Wright (1984).

In this paper we make an empirical investigation of the relationship between the consistency, coherence and calibration of probability judgements in a forecasting context.

We have argued previously (Ayton and Wright, 1985) that probabilistic thinking should be taught and evaluated with real-world examples rather than by using exclusively abstract examples from the domain of urns and dice. This is because there is evidence suggesting that people do not always apply their competence with probability concepts when expressing uncertainty about possible future scenarios. For instance, according to one axiom of probability theory, addititivity, the component probabilities of a set of mutually exclusive and exhaustive events should add to one. Wright and Whalley (1983) have shown that whilst people do use the probability axiom of additivity when assessing probabilities in abstract ‘classroom- type’ problems such as those using dice, when the problems have a real-world content, such a5 assessing the probabilities of horses winning a horse race, people tend to show non- additivity. The most common form of non-additivity is supra-additivity, where probabilities sum to more than unity. Wright and Whalley argued that individuating, case-specific information tends to disassociate a specific event from its event set and causes the obtained non- additivity with real-world as opposed to abstract problems. In the horse-race problem, the individuating information was presented in the form of a record of each horse’s previous performances.

By consisrency in probability forecasts we mean the extent to which a probability assessor shows the same degree of belief in the likely occurrence of an event when assessments obtained by different assessment methods are converted to a common metric. For example, odds of 3 to 2 given against an event happening should correspond to a point probability estimate of 0.4 given f o r the event happening.

Point probability estimates and odds are two commonly used direcr methods for probability assessment. However, relatively little attention has been paid in the literature to the question of whether there is a possible response mode bias, i.e., if probability estimates derived by different methods for the same event are inconsistent, which method should be taken as the true index of degree of belief? Several researchers have shown that responses based on odds are more extreme than those expressed as probabilities (Du Charme, 1969; Phillips and Edwards, 1966; Wheeler and Edwards, 1975). But as Wallsten and Budescu (1983) note ‘It is an open question as to whether one response mode gets closer to true opinion, or indeed whether opinion actually varies with response mode’ (p. 160).

By coherence in probability forecasts we mean the extent to which a probability assessor’s forecasts conform to the axioms or laws of probability theory. The additivity axiom mentioned earlier is one such probability law. Another axiom is the intersection law which states that the probability of event A and event B both happening is the product of the probability of event A multiplied by the probability of event B given event A has happened. In terms of an abstract example, the probability of drawing two consecutive aces from a pack of fifty-two cards, without replacing the first card drawn, is

4 3 52 51 2652

P(2 aces) = - x - = 12 = 0.00452

Barclay and Beach (1972) investigated the combinatorial properties of subjective probabilities for the union and intersections of events based on holistic assessment of the events and their constituent marginal and conditional probabilities (see Figure 1 for definitions of

Page 3: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayfon Probability Forecasts 187

these probabilities). These investigators found that while people did not obey the probability laws in their assessments, the proper rules were a better description of assessment of the complex event probabilities than improper rules involving wrongly multiplying instead of adding, or wrongly adding, or taking a mean instead of multiplying.

So although, strictly speaking, the subjective probabilities did not conform to the probability laws, the same laws provided the best description, suggesting that in this case subjective probability fits a ‘truth plus error’ model which states that although subjects will make occasional ‘irrational’ errors they are essentially coherent.

Bar-Hillel (1973) had subjects choose between gambles, where the outcome of one gamble depended on a single elementary event, and the other depended upon a complex compound event occurring. She found evidence that the subjective probability of a compound event was biased towards the probability of its components, resulting in an overestimation of conjunctive events and an under-estimation of disjunctive events. These probabilities were for essentially abstract situations where no knowledge (excepting that of probability theory) would aid judgement. As we have seen, it may be important to bear this factor in mind when considering whether judgement is coherent.

By Calibration in probability forecasts we mean the extent to which an assessed probability is equivalent to percentage correct over a number of assessments of equal probability. For example if you assign a probability forecast of 0.7 to each of twenty events occurring, fourteen of these events should actually occur. Similarly each event assessed as certain to occur, 1.0 probability, should occur.

A good overview of calibration research is contained in Lichtenstein, Fischhoff and Phillips (1981). Up until that time, most studies of the calibration of subjective probabilities utilized general knowledge questions of the form: ‘which is longer? (a) Suez Canal; (b) Panama Canal’. The respondent is required to indicate the answer he or she feels to be the correct one and then indicate how sure he or she is of this by assigning a probability in the range 0.5 to 1.0. Since the experimenter already knows the answers to the questions, calibration and related measures can easily be computed. However, it has been argued (Wright and Ayton, 1986a) that the process of making probability judgements about the likelihood of future events is very different from that of judging the veracity of general knowledge propositions. In particular, several variables which cannot be manipulated in studies of calibration using general knowledge items may potentially influence the process of probability forecasting and subsequent calibration. For example, we have investigated the effects of the imminence and time duration of a forecast period (Wright and Ayton, 1987a; in press a) and the influence of the desirability and perceived controllability of the forecast events (Wright and Ayton, in press b).

We have argued recently (Wright and Ayton, 1987b) that consistency, coherence and calibration are logically interrelated in the same manner as reliability and validity in questionnaire design. Wallsten and Budescu (1983) have taken a similar psychometric approach to subjective probability assessment. Using the terminology of psychometrics, consistency can be viewed as being analogous to alternate-form reliability, coherence as a measure of internal- consistency reliability and calibration as a measure of predictive validity. Construed in this manner it can be seen that perfect consistency is a pre-requisite for perfect coherence and perfect coherence is, in turn, a base-line for perfect calibration. Although complete fulfilment of each stage of our analysis of judgemental forecasting is a pre-requisite for complete fulfilment of the subsequent stage, this fulfilment is a necessary rather than sufficient condition. To illustrate the argument, consider a poorly manufactured rule which gives perfectly reliable measures of distance but is invalid since it under-measures by one millimetre in every metre.

In this paper we investigate the empirical relationships between the consistency, coherence

Page 4: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

188 Journal of Forecasting Val, 7, Iss. No. 3

and calibration of probability assessments using untrained assessors making odds and point probability estimates in a forecasting task. Our hypotheses are somewhat of the ‘let’s see what happens if ...’ variety, but are based on our intuition that someone who is a well-calibrated forecaster will be coherent and consistent in his or her forecasts. Or, to put it another way, inconsistent and incoherent forecasters will tend to be poorly calibrated.

Overconfidence is a well-duplicated finding in calibration research-generally for all events assessed as having a 0 . X X probability of occurrence less than XX% actually occur. Since there is no finite upper of lower limit to odds assessment our hypothesis is that overconfidence will be more prevalent with odds responses as opposed to point probability responses.

We also hypothesise that assessment of ‘simpler’ marginal and conditional probabilities will be better calibrated than holistic assessment of more ‘complex’ compound probabilities such as intersections, unions and disjunctions. In our earlier playing-card example, the intersection of drawing two consecutive aces could be assessed holistically, as could the marginal probability of the first draw being an ace and the conditional probability of a second ace given an ace is drawn first.

Why should the simpler marginal and conditional probabilities be better calibrated than the more complex compound probabilities? Our logic is based on the same ‘divide-and-conquer’ rationale that underlies use of subjective expected utility theory and Bayes’ theorem as normative decision aids. The implicit rationale is that humans have a ‘limited processing capacity’ and although they are able to provide decomposed inputs to technologies implementing the normative theories, the mathematical manipulations of these inputs are best left to a computer. A similar ‘divide-and-conquer’ rationale is implicit in cross-impact analysis, which is used as an aid to judgemental forecasting. Cross-impact analysis is based on a modification of the Delphi approach and produces a probability forecast for a ‘scenario’. A scenario being a list of events that might occur by a particular time. The scenario can be seen as a complex intersection of many events. Forecasters are required to assess pairwise conditional probabilities to define simple interactions which are then used to compute scenario probabilities.

Of course, the holistic probability of a compound event, such as an intersection, can also be assessed directly and in this experiment we compare the calibration of holistic and recomposed compound probabilities. Our expectation is that the recomposed assessments should be better calibrated than the counterpart holistic assessments-otherwise the ‘divide-and- conquer’ rationale would be undermined.

METHOD

Thirty-six British students attending the City of London Polytechnic were paid f2 each to complete our forecasting questionnaire in January 1985.

The questionnaire was divided into two equivalent parts, each containing an identical set of 100 statements such as: The pound sterling (a) will

(b) will not be below $1.00 in March 1985.

One set of 100 statements required probability assessments, the other required odds assessments.

Twenty questions were devoted to each of five possible future events. The events included were: event I, the pound falls below $1.00; event 11, an airline crash occurs in which lives are

Page 5: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayton Probability Forecasts 189

lost; event ZH, the temperature in Paris exceeds 1 5 O C ; evenf ZV, Manchester United Football Club loses a football match; event V, a British soldier is killed by violence in Northern Ireland.

For each event, forecasts were elicited for five time periods. As well as March (time period B), amended versions of the same statements dealt with February (time period A), April (time period C), 1-15 March (time period D), and 16-31 March (time period E).

Questions 1-5 asked for an estimation of the likelihood of an event’s occurrence in each of the five time intervals and questions 6-11 asked for probabilities conditional on the event’s occurrence or non-occurrence in the prior time interval, i.e., P(A), P(B), P(C), P(D), P(E), P(B\A), P(C\B), P(E\D), P(B\A), P(C\B), P(E\D). Questions 12-14 were concerned with the intersection probabilities of the event happening at least once in each of two adjacent time periods, i.e., P(A and B), P(B and C), P(D and E). Finally, questions 15-20 which involved disjunctions or unions of two subsequent time periods were assessed i.e., P(A or B), P(B or C ) , P(D or E), P(B and/or C), P(A and/or B), P(D and/or E).

The ordering of the five events, and of the questions within the events, was randomised across subjects, and the ordering of the two sets of 100 questions was counterbalanced. In completing the questionnaire, subjects were required to mark the right answer and to indicate how sure they were of the answer. In one part this was achieved by stating a probability between 0.5 and 1, where 0.5 indicates a ‘don’t know’ response, and a probability of 1 quantifies a certainty response. In the other part of the questionnaire the same questions required subjects to indicate how certain they were by giving odds for or against a particular event happening, e.g., odds of 1 to 1 quantify an equal likelihood of an event’s occurrence as its non-occurrence, and very high odds would indicate high confidence.

After each question the subjects rated how difficult they found it to produce the probability or the odds response. Subjective difficulty was assessed on a 7 point scale with 1 meaning extremely easy and 7 meaning extremely difficult.

Simple holistic forecasts were used as inputs to probability equations from which recomposecl compound forecasts were computed, this gave a further 9 compound forecasts for each event. Figure I sets out the probability equations we used.

Intersection, P(A and B) = P(A) P(B\A) Disjunction, P(A or B) = P(A)[ 1 - P(B\A)] [ 1 - P(A)] P(B\A) Union, P(A and/or B) = P(A)P(B\A) + P(A)[ 1 - P(B\A)I + [ 1 - P(A)] P(B\A)

Figure 1. Recomposed compound forecasts based on their simple holistic constituents

After the time allowed for the events’ occurrence or non-occurrence had elapsed, the three probability performance measures proposed by Murphy (1973) were calculated. These measures of calibration, overlunder-confidence, and resolution were calculated for each subject, each event, each type of question and each response mode.

The measures of judgemental forecasting performance are defined as follows:

l T N I = I

Calibration = - nr(rt -

Where nr is the number of times response rr was used, cr is the proportion correct for all items assigned probability rt; and T is the total number of different response categories used. A perfectly calibrated person would score 0 on this measure. The two other probability

Page 6: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

1 90 Journal of Forecasting Vol. 7, Iss. No. 3

performance measures proposed by Murphy (1973) are as follows:

l T N 1 1 1

Over/underconfidence = - C nl(rl - c0

This measure is equivalent to the difference between the mean of the probability responses and the overall proportion correct. Overconfidence is shown by a positive difference, underconfidence by a negative difference.

l T N , = I

Resolution = - nr(cl - c)’

This measure reflects the assessor’s ability to discriminate individual occasions on which the event of interest will and will not take place, The higher the resolution score the better resolved the judgements.

Since the number of categories, T, will affect the size of the calibration and resolution scores, all the responses were grouped into six categories: 0.5 to 0.59, 0.6 to 0.69, 0.7 to 0.79, 0.8 to 0.89,0.9 to 0.99 and 1 .O. The mean response in each category was used as T I , and the proportion correct across the whole category was used for ct. This method is in accordance with Lichtenstein and Fischhoff (1980).

RESULTS

Three of the simple conditional forecasts for each event were omitted from the analysis. This was because subjects were required to give both conditional and negative conditional

Table 1. The time periods in which an event occurred at least once

Event Time Period __

A B C D E I

I1 J J J

111 J

I v J r / J

V J J J

Time Period: A-February 1985 B-March 1985 C-April 1985 D-1 March-15 March 1985 E-16 March-31 March 1985

Event: I-The pound falls below $1.00 11-There is an airline crash in which lives

are lost 111-The temperature in Paris exceeds 15’ IV-Manchester united football club lose a

football match

Northern Ireland V-A British soldier is killed by violence in

Page 7: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayton Probability Forecasts 19 1

assessments [e.g. P(B\A) and P(B\A)] and, in reality, the conditioning event either happened or did not happen in the time period of the forecast. Table 1 gives a summary of the actual occurrence of the forecast events.

The effect of response mode Our initial analysis was to measure an individual's consistency by finding the absolute discrepancy between each point probability forecast and the corresponding odds response (converted to a point probability). For each subject we computed a total discrepancy score over the 100 pairs of questions. The mean of these scores was 24.45 with a standard deviation deviation of 7.13, showing that, on the average, each subject on each question had a dis- crepancy of a 0.24 probability between uncertainty expressed as a point probability estimate as opposed to uncertainty expressed as an odds estimate.

A graphical representation of calibration is computed by plotting assessed probability against proportion correct for all statements assigned that probability. The mean probability within the six assessment categories, detailed in the method section, was taken as the assessed probability for a particular category. For odds responses, forecasts were converted to the probability metric and odds of 200 to 1 or greater were counted as a certainty response.

Figure 2 sets out the calibration curves for odds and probability responses. Each curve is based on 3,600 data points.

Both curves indicate overconfidence, and the odds forecasts of certainty show extreme overconfidence. Interestingly, a breakdown of all odds responses of 200 to 1 or greater revealed no systematic differences in forecasting performance characteristics. Table 2 reports overall measures of probability response and performance computed across all subjects and questions within the odds and probability divisions of our questionnaire.

The table reveals that, overall, the odds version of the questionnaire was only slightly more difficult than the point-probability version. However, calibration was better and overconfidence less for the point estimates. Taking Table 2 and Figure 2 together it appears that the reason for this result is that about three times as many certainty responses' were given to the odds

1.0 - c 0.9- 0

0

c

loo 1 / c 0.9

E 0.8

0 0.7

k 0.6 0.5

0

0 C .- c

I 0.4 1 1 1

0.5 0.6 0.7 0.8 0.9 1.0

Subjects response

- Probabilities -*-.* Odds Figure 2. Calibration curves for point probabiiity and odds forecasts

Page 8: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

I92 Journal of Forecasting Vol. 7, Iss. No. 3

Table 2. Overall group measures for judgemental forecasts expressed in probabilities and in odds

Point Probability Estimates Odds Estimates

Proportion Correct 0.549 Calibration 0.037 Over/undercon fidence 0.169 Resolution 0.005 Mean Probability Response 0 . 7 18 No. of 0.5 Responses 464 No. of 1.0 Responses 183

0.533 0.106 0.282 0.003 0.814

384 585

Table 3 . Table of mean comparisons of the response and performance measures between the probability and odds form of the questionnaire

Point probability estimates

Odds est irnates I

Proportion correct

Cali bration

Overlunderconfidence

Kesolut ion

Mean Probability Kesponse Number of 50 per cent responses Number of 100 per per cent rcsponscs Subjective difficulty

X- 0.548 s 0.098

S 0.067 s 0.042

S 0.169 s 0.094

-4- 0.02 s 0.014

X 0.718 s 0.069

S 12.92 s 12.06

x 5 . 0 8 s 8.71

s 3.49 s 0.934

S 0.531 s 0.107

s 0.145 s 0.119

s 0.282 s 0.157

X 0.019 s 0.015

S 0.814 s 0.085

X 10.72 s 12.36

S 16.25 s 20.75

s 3 . 3 3 s 0.874

t = 1.275, NS

t = 4.92, p < 0.001, two tailed

one tailed t = - 4.486, j~ < 0.001,

t = 0.348, NS

I = -6.115, p<O.OOl, one tailed

t = 1.035, N.S.

t = - 2.998, p < 0.01, one tailed

t = 0.966, NS

version of the questionnaire and these certainty responses produced the overall poor calibration and overconfidence scores. Twenty-nine of our thirty-six subjects were better calibrated in point-probabilities than odds and twenty-eight were less overconfident in point-probabilities than odds.

We also computed the forecasting response and performance measures on an individual basis and made mean comparisons across our division of the questionnaire. However, since individual calibration measures a re based on relatively few data points, this analysis should be treated with caution. Table 3 sets out these results, which tend to confirm the overall group analysis already reported. In addition, the comparisons reveal that odds are subjectively no harder t o produce than point estimates. A result which suggests that our respondents would have no preference between response modes, given a free choice.

Page 9: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayton Probability Forecasts 193

Table 4. Overall group response and performance measures for holistic and recomposed forecasts

PROBABILITY ODDS

Simple Compound Recomposed Simple Compound Recomposed Holistic Holistic Compound Holistic Holistic Compound

Calibration 0.030 0.036 0.089 0.100 0.1 12 0.117 Over/underconfidence 0.171 0.168 0.195 0.275 0.288 0.298 Resolution 0.005 0.005 0.005 0.003 0.004 0.006 Proportion Correct 0.549 0.549 0.597 0.529 0.536 0.533 Mean Probability Response 0.720 0.716 0.792 0.804 0.823 0.832 No. of 0.5 Responses 219 245 41 199 185 85 No. of 1.0 Responses 85 98 450 260 315 632

Simple versus compound holistic assessments Table 4 sets out our overall group measures of probability response and performance calculated over all questions and subjects within our odds/probability divisions of the questionnaire. Since individuals made only forty and forty-five responses for the simple and compound responses, respectively, we did not compute individual measures of forecasting performance analogous to those we calculated in the previous section of this paper. Table 4 also contains measures for the recomposed compound probabilities; this analysis is referred to in the next section of this paper.

It is evident from the Table that there is very little difference in forecasting response and performance between simple and compound holistic probability assessments within the odds and probability division of our questionnaire. However, subjects did rate the simple probabilities and the simple odds subjectively easier to assess than the compound forecasts, t = -2.76, n = 36, p < 0.05, two-tailed and 1 = - 5.962, n = 36, p < 0.001, two-tailed, respectively.

Holistic versus Recomposed Compound Probabilities Figure 3 compares the calibration curves of holistic and recomposed compound probabilities. Each curve is based on 1620 data points.

From the group calibration curves it appears that the recomposed probability forecasts are better calibrated and less overconfident for probabilities below 0.79, whereas the recomposed certainty assessments are extremely overconfident and show a proportion correct of less than 0.5.

Figure 4 compares the calibration curves of holistic and recomposed compound odds assessments. The two curves show much the same form over the whole range of forecasts.

Referring back to Table 4 we can see these results reflected in our mathematical measures of forecasting performance. it is interesting to note that the recomposition of the simple holistic assessments has reduced the number of 0.5 responses given to the compound events by 83% and 54% for the probability and odds responses, respectively. Conversely, the number of 1 .OO responses has increased by 460% and 200070, respectively.

Individual subject analyses for the point-probability forecasts reveal that, overall, the recomposed probabilities are no better calibrated, over-confident, or resolved than their holistic equivalents.

Page 10: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

194 Journal of Forecasting Vol. 7, Iss. No. 3

1 .o

0.9

0.6 0 L n

0.5

0.5 0.6 0.7 0.8 0.9 1.0 Assessed probability

.... Holistic compound probability forecasts - Recomposed compound probability forecasts

Figure 3. Calibration curves for holistic and normative compound probability forecasts

1 .o

0.9

0.8

- 0

0 0

0.7

0.6

.- c. L

0 L a

0.5

0.4 0.5 0.6 0.7 0.8 0.9 1.0

Assessed probability

.... Holistic compound odds forecasts - Recomposed compound odds forecasts

Figure 4. Calibration for holistic and normative compound odds forecasts

Page 11: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G . Wright, C. Saunders and P. Ayton Probability Forecasts 195

Table 5. Intercorrelation matrix of consistency, coherence and forecasting performance

C PC oc PCAL POV PRES OCAL OOV ORES

Consistency (C)

Probability Coherence

(PC) Odds Coherence

(OC) Probability Calibration

(PCAL)

Probability Over/under- confidence

(POW Probability Resolution

(PRES)

Odds Calibration

(OCAL)

Odds Overlunder- confidence (OOV)

Odds Resolution

(ORES)

0.168

0.157

0.217

0.248

- 0.257

0.657**

0.672**

- 0.217

0.69**

- 0.195 - 0.212

- 0.245 - 0.332*

- 0.276 - 0.21 1

0.000 0.062

- 0.126 - 0.105

0.092 0.103

0.752**

0.206 0.01 1

0.559** 0.264 -0.055

0.490** 0.366* - 0.165 0.880**

0.246 -0.388* 0.111 -0.251

* p < 0.05 p < 0.01 **

The relationship between Consistency, Coherence and Forecasting Performance This analysis involved computing an intercorrelation matrix between individuals’ consistency using point-probability and odds response modes, measures of coherence, and the forecasting performance measures for both odds and point probability assessments.

Two measures of coherence were computed, one for the point-probability responses and one for the odds responses. Each measurement was simply the summed total of any discrepancy when an individual’s holistic compound assessment was subtracted from its recomposed compound pairing. The mean of individuals’ probability coherence measure was 3.54 with a standard deviation of 4.32 whilst the mean of the odds coherence measure was 5.87 with a standard deviation of 7.81.

Table 5 sets out the intercorrelation matrix. The pattern of intercorrelations is not as we anticipated. Our measure of consistency is not

significantly related to coherence within the odds or probability response modes. However, the obtained correlations with odds calibration and odds over/underconfidence suggests that people who are more consistent are also better calibrated and less overconfident with odds responses.

Page 12: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

196 Journal of Forecasting

DISCUSSION AND CONCLUSION

Vol. 7, Iss. No. 3

Our results indicate that consistency, coherence and calibration are not closely related. Knowing the extent of an individual’s consistency and coherence will not help us in predicting his or her degree of calibration, overconfidence or resolution with point-probability forecasts. However, people who are more consistent between point probability and odds forecasts tend to show better calibration and less overconfidence with their odds responses. This result, taken together with our finding that twenty-nine of our thirty-six subjects were better calibrated in point probabilities than in odds and that our subjects were, in general, more coherent using point probabilities than odds forecasts, suggests that untrained probability assessors should use point probabilities rather than odds in making judgemental forecasts of the likelihood of future events. This result is significant since our subjects found it no more difficult to answer our questions in the odds format compared to the point probability format.

In line with our predictions, subjects’ used three times as many certainty responses with the odds version of the questionnaire and these responses were abysmally calibrated. This result extends the findings of DuCharme (1969), Phillips and Edwards (1966) and Wheeler and Edwards (1975), which we discussed earlier, to studies of calibration.

Contrary to our expectations, we found very little difference in forecasting response and performance between simple and compound holistic forecasts within the probability and odds division of our questionnaire. This result is evidence against the validity of the ‘divide-and- conquer’ rationale underlying most applications of normative decision theory. Indeed, our rccompositions of marginal and conditional assessments into compound forecasts were no better calibrated or resolved than their holistic counterparts.

Why should this be so? Consider the abstract example of assessing the probability of drawing two consccutive aces from a pack of fifty-two cards, which we detailed in the introduction to this papcr. Intuitively it would seem that most people could, with a little thought, accurately assess the probability of drawing an ace on the first draw and the probability of a subsequent second ace. I t also seems intuitively reasonable that most people would not easily make an accurate computation of the probability of the intersection of the two events. This intuitive analysis of human information processing capabiIities is in contradiction to the major thrust of the present paper. However such counter-examples are based on the subjective assessment of probabilities where the marginal, conditional, and compound probabilities are, in principle, calculable by the probability assessor but the cotnpoundprobabilities require more calculation. Arguably, in such instances the recomposed forecasts for such compound events will more accurately reflect the ‘true’ probabilities than direct holistic assessments of these events. Such a paradigm matches that of Bar-Hillel (1973) which we described in the introduction to this paper. In real-world probability forecasting there is, of course, no precondition that marginal and conditional assessments will be ‘easier’, and so more accurate, than compound assessments.

From a policy point of view, should a decision analyst who is eliciting probability assessments from a forecaster work on the forecaster’s simple or compound intuitive forecasts? Our results cannot speak to this question. I t may be the case that apparently more complex judgements are, in practice, when referring to knowledge of particular events, easier to make.

An experiment reported by Tversky and Kahneman (1983) can be used to illustrate this point. Their study, which demonstrated probability incoherence in professional statisticians, required subjects to rate the probabilities of various events, some of which were inclusive of others. Thus, the probability that next year there will be an earthquake in San Francisco causing a dam burst resulting in a flood in which at least 1,000 people drown rnusf be less than the probability that there will, next year, be a flood somewhere in the USA in which at least 1,000 people

Page 13: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayton Probabilify Forecasts 197

drown. However, the subjects typically rated the logically less likely event as more likely than the more general event set of which it is a member. Note that, in this instance, the former event contains a plausible causal element for the flood (i.e., an earthquake in San Francisco) which, on the face of it, might not have occurred to the subject as a salient consideration when judging the latter event.

So although in terms of probability theory one event represents a higher order (more complex) conditional than the other, in terms of the forecasters understanding of the causality underlying the relative likelihood, the former may be easier to conceive than the latter and so may be easier to judge the likelihood of. However, it remains to be determined whether the probability given to the more specific, event should be decreased, the less specific likelihood increased, or whether both should be adjusted.

Clearly, one of these options must be followed in any given case because, as we indicated earlier, incoherent probabilities cannot represent valid judgements. In the FORECAST program (Wright, Ayton and Whalley, 1985) probability assesments of elementary marginal and conditional probabilities are used to define recomposed probabilities of intersections, unions and disjunctions of events. Holistic forecasts of these events are also elicited by the program but incoherence is interactively resolved-the program does not impose the recomposed forecasts on users.

What are the implications of the findings of the present study? They are quite clear cut. First, untrained judgemental forecasters should use point-probabilities in preference to odds. Second, judgemental forecasts of complex compound probabilities are as well assessed holisitically as they are using the methods of decomposition and recomposition utilizied in techniques like cross-impact analysis. As we noted earlier, applications of decision analysis and Bayes’ theorem are also predicated on a decompose-then-recompose rationale. How has the validity of this rationale been tested with these other judgement-aiding technologies? The usual indirect approach to questions of validity has been an appeal to the general acceptability of the axiom base of the theories (see Wright, 1984). However, the axiom base of subjective expected util i ty theory which underlines applications of decision analysis is not without its critics (see Slovic and Tversky, 1974). The axiom base of Bayes’ theorem is less controversial but the studies which have shown discrepancies between Bayes’ theorem posterior opinion and holistically assessed posteriors have been criticised by Winkler and Murphy (1974) on the grounds that people apply their complex, real-world information processing strategies to overly simple laboratory tasks in inappropriate ways, resulting in ‘suboptimality’ against the experimenter’s criterion. With real-world opinion revision, it is usually impossible to check the veracity of prior opinion, likelihood and posterior opinion against a suitable agreed-upon criterion, as it is in laboratory investigation of opinion revision.

As we noted earlier, in cross-impact analysis forecasters assess pairwise conditional pro- babilities which define complex scenarios or intersections of events. However, as we might anticipate from the results of our present study, the ‘raw’ conditional probabilities elicited often do not conform to the laws of probability. Incoherence may be extensive. Kirkwood and Pollock (1982) utilized marginal and conditional probabilities in a cross-impact approach and had to omit many judgements altogether in order to resolve incoherence with the laws of pro- bability. In fact, for two of their three experts all conditional probability estimates had to be discarded.

We (Wright and Ayton, 1987b) suggest that a danger of such mathematical manipulation to remove incoherence within a single expert or inconsistency between two or more experts is that the resulting judgements may not conform to the expert’s, or even anyone’s, understanding of the causal determinancy of the forcast scenario.

McLean (1976) has also criticised the cross-impact approach in that: ‘by concentrating atten-

Page 14: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

198 Journul of Forecasting Vol. 7, Iss. No. 3

tion on the manipulation and refinement of probability estimates, it lost sight of the fact that such estimates can only at best be temporary substitutes for an understanding of the causal structure of socioeconomic processes’ (p.349).

Sarin (1979) proposed a procedure for use in cross-impact analysis that is less of a mathematical manipulation of the expert’s forecasts and allows a limited interaction with the forecaster. Here the expert assesses marginal and first-order conditional probabilities and from these the pairwise intersection probabilities are determined allowing boundaries to be computed for higher order intersections that fully describe the scenarios under consideration.

De Kluyver and Moskowitz (1984) have also produced some improvements to this general approach with what they refer to as interactive goal programming. This explicitly emphasises the importance of allowing forecasters to revise and adjust their judgements so that the final probabilities conform to the probability axioms and their beliefs about the interdependence of the events making up the scenario. Finding the most suitable way of presenting incoherence feedback may well be the key to optimising the judgements resulting from an interactive approach.

From our point of view, De Kluyver and Moskowitz’s approach should be a more valid approach than the earlier techniques of cross-impact analysis that we have just described. This is because it places less implicit emphasis on the relative validity of simple intuitive assessments over intuitive compound assessments. Nevertheless, we know of no studies that have tested the forecasting performance of any of the cross-impact techniques, including De Kluyver and Moskowitz’s more interactive approach. Until such specific studies are conducted, probability forecasts for scenarios provided by cross-impact techniques must be viewed with caution.

Finally, we must return to our finding that, overall, our subjects performed poorly on our forecasting tasks. I t must be noted that our five main forecasting topics were focussed on events over which our forecasters would, perhaps, perceive that they had little control and would, fur- thermore, be relatively indifferent about the possible outcomes. Ecologically, it may be that such events are seldom considered by forecasters (cfWright and Ayton, in press b). In addition we asked untrained subjects to make judgements for events about which they did not profess expertise. We also did not give our subjects an opportunity to resolve inconsistency and incoherence.

From these comments it is clear that several issues remain unresolved and remain to be investigated. What we hope that we have established in this paper is a paradigm for more detailed and realistic studies of judgemental probability assessment in forecasting.

NOTES

‘Recall that all odds responses of 200 to 1 or greater were coded as certainty responses. I t is of course impossible to express certainty in finite odds.

REFERENCES

Ayton, P. and Wright, G. ‘Thinking with probabilities’. Teaching Starisrics, 7 (1985) 37-40. Barclay, S. and Beach, L. R. ‘Combhatorial properties of personal probabilities’. Orgawitrriio/icr/

Bar-Hillel, M. ’On the subjective probability of compound events’. Organizational Behavior and Hlirirlrri

Behavior and Huinan Performance, 8 (1972) 176- 183.

Performance, 9 (1973) 396-406.

Page 15: The consistency, coherence and calibration of holistic, decomposed and recomposed judgemental probability forecasts

G. Wright, C. Saunders and P. Ayfon Probability Forecasts 199

De Kluyver, C. A. and Moskowitz, H. ‘Assessing scenario probabilities via interative goal programming’. Management Science, 30 (1984) 273-278.

Du Charme, W. M. ‘A review and analysis of the phenornen of conservatism in human inference’. Technical Report No. 46-5, Inter-disciplinary program in applied mathematics, Houston, Texas, 1969.

Kirkwood, C. W. and Pollock, S. M. ‘Multiple attribute scenarios, bounded probabilities, and threat5 of nuclear theft’. Futures, 14 (1982) 545-553.

Lichtenstein, S. and Fischhoff, B. ‘Training for calibration’. Organizational Behavior and Huiiian Perfor- mance, 26 (1980) 149-171.

Lichtenstein, S., Fischhoff, B. and Phillips, L. D. Calibration of subjective probabilities: The state of the art to 1980. In D. Kahneman, P. Slovic, and A. Tversky (eds). Judgment under Uncertainty: Heuristics and Biases. New York: Cambridge University Press, 1981.

Mctean, M. ‘Does cross-impact analysis have a future?’ Futures, 8 (1976) 345-349. Murphy, A. H. A new vector partition of the probability score. Journal of Applied Meteorology, I2

Phillips, L. D. and Edwards, W . Conservatism in a simple probability influence task. Joirrnal oj Experimental Psychology, 12 ( 1 966) 346-354.

Sarin, R. K. An approach to long-term forecasting with an application to solar energy. Managenienr Science, 25 (1979) 543-554.

Slovic, P. and Tversky, A. ‘Who accepts Savage’s axiom?’ Behavioral Science, 19 (1974) 368-373. Tversky, A. and Kahneman, D. ‘Extensional versus intutive reasoning: The conjunction fallacy in

probability judgement’. Psychological Review, 90 (1983) 293-314. Wallsten, T. S. and Budescu, D. V. ‘Encoding subjective probabilities: A psychological and psychometi ic

review.’ Management Science, 29 (1983) 151-173. Wheeler, G. E. and Edwards, W. ‘Misaggregation explains conservative inference about normally

distributed populations’. SSRI Research Report 75-11, University of Southern California, Social Source Research Institute, Los Angeles, California, 1975.

Winkler, R. L. and Murphy, A. H. ‘Experiments in the laboratory and the real-world’. Organizational Behavior and Human Performance, 10 (1973) 252-270.

Wright, G. Behavioral decision theory. Beverly Hills: Sage, 1984. Wright, G. and Ayton, P. ‘Subjective confidence in forecasts: A response to Fischhoff and McGregor’.

Journal of Forecasting, 5 (1986) 117-123. Wright, G. and Ayton, P. ‘Task influences on judgmental forecasting’. Scandinavian Journal of

Psychology, 28 1987(a), 115-127. Wright, G. and Ayton, P. ‘The psychology of forecasting’. In Wright, G. and Ayton, P. Judgmental

Forecasting Chichester: Wiley, 1987(b). Wright, G. and Ayton, P. ‘Immediate and short-term judgmental forecasting: Situationism, per-

sonologism or interactionism’. Personality and Individual Differences, in press a. Wright, G. and Ayton, P. ‘Judgmental probability forecasts for personal and impersonal events’. Inter-

nalional Journal of Forecasting, in press b. Wright, G., Ayton, P. and Whalley, P. ‘A general purpose computer aid to judgmental forecasting:

Rational and procedures’. Decision Support Systems, 1 (1985) 333-340. Wright, G. and Whalley, P. ‘The supra-additivity of subjective probability’. In B. P. Stigum and

F. Wenstop (eds) Foundations of risk and utility theory with applications. Dordrecht: Reidel, 1983.

(1973) 595-600.

A uthors ’ biographies: George Wright is reader in business policy at Bristol Business School. He has published widely on the human aspects of decision-making and forecasting. His books include ‘Behavioural Decision Theory’, Harmondsworth, Penguin, 1984; ‘Behavioural Decision Making’, New York, Plenum, 1985; and with Peter Ayton, ‘Judgmental Forecasting’, Chichester, Wiley, 1987.

Carol Saunders was an undergraduate student in psychology at City of London Polytechnic at the time this study was undertaken.

Peter Ayton is a senior lecturer in psychology at the City of London Polytechnic where he is primarily occupied in the teaching of a new MSc course in decision-making. His current research activities include the study of intuitive statistical concepts, memory, metaphorical comprehension and the information processing strategies of television viewers.