artificial intelligence expert systems for clinical diagnosis: are they worth the effort?

19
ARTIFICIAL INTELLIGENCE EXPERT SYSTEMS FOR CLINICAL DIAGNOSIS: ARE THEY WORTH THE EFFORT?’ by Barbara Carroll Carleton Uniuersity, Otlawa, Ontario, Canada Modeling the decision-making processes of human experts has been studied by scientists who call themselves psychologists and by scientists who say they are students of artificial intelligence (Al). The psychological research literature suggests that ex- perts’ decision-making processes can be adequately captured by simple mathematical models. On the other hand, those in A1 who are preoccupied with human expertise maintain that complex computer models, in the form of expert systems, are required to do justice to those same processes. The resultant paradox of simple versus complex decision-making models is investigated here. The relevant literatures in psychology and A1 are reviewed and, based on these findings, a resolution of the paradox is offered. KEY WORDS: organism, decider subsystem; expert system, artificial intelligence, clinical diagnosis. c13 OTH PSYCHOLOGISTS and A1 specialists B have been interested in the ability of human experts to make good, accurate di- agnostic decisions. Both have attempted to model the decision-making processes of ex- perts in a number of fields. And both have used computers in the modeling process. However, despite similar aims, psycholo- gists and those working in mainstream A1 have approached the topic of human exper- tise very differently. Within psychology, researchers have pursued three major themes. First, they have examined the accuracy of experts’ di- agnostic decisions. Second, they have fo- cused on the judgment process itself and have been interested in the degree to which a particular regression equation can be used to predict an expert’s decisions. And, third, they have considered the possibility of re- placing experts by linear regression models and have investigated the decision-making contexts in which these models lead to an increase in predictive accuracy (c.f. Dawes & Corrigan, 1974;Hoffman, Slovic & Rorer, 1968). In contrast to the approach taken by The author wishes to thank Warren B. Thorngate and Kellogg Wilson for their valuable comments on earlier drafts. - 274 Behaworal Science, Volume T2, 1987 psychological researchers, those in A1 who are interested in human expertise do not attempt to assess the accuracy of experts’ diagnostic decisions. Instead they assume human experts are the ideal and that, on the basis of their knowledge, they are able to achieve an outstanding level of decision- making performance (Hayes-Roth, Water- man & Lenat, 1983).Thus, A1 workers view human expertise as a highly valuable and scarce resource. This belief provides them with the rationale for expending the large amounts of time, effort, and money neces- sary to tap as much of the complexity of experts’ knowledge and skill as possible, and to model it as an expert system. When the products of the work on human expertise conducted in psychology and A1 are compared an intriguing paradox emerges. While psychologists are content to model experts’ decision-making proc- esses by simple linear equations, those in A1 favor the use of intricate and highly sophisticated expert systems. This paradox of simple versus complex decision-making models needs some form of resolution. Ex- pert systems are extremely expensive. They require at least a five-year commitment to develop, and they represent an investment of several million dollars, even before they reach the stage of being tested prior to

Upload: barbara-carroll

Post on 06-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

ARTIFICIAL INTELLIGENCE EXPERT SYSTEMS FOR CLINICAL DIAGNOSIS: ARE THEY WORTH THE EFFORT?’

by Barbara Carroll

Carleton Uniuersity, Otlawa, Ontario, Canada

Modeling the decision-making processes of human experts has been studied by scientists who call themselves psychologists and by scientists who say they are students of artificial intelligence (Al). The psychological research literature suggests that ex- perts’ decision-making processes can be adequately captured by simple mathematical models. On the other hand, those in A1 who are preoccupied with human expertise maintain that complex computer models, in the form of expert systems, are required to do justice to those same processes. The resultant paradox of simple versus complex decision-making models is investigated here. The relevant literatures in psychology and A1 are reviewed and, based on these findings, a resolution of the paradox is offered.

KEY WORDS: organism, decider subsystem; expert system, artificial intelligence, clinical diagnosis.

c13

OTH PSYCHOLOGISTS and A1 specialists B have been interested in the ability of human experts to make good, accurate di- agnostic decisions. Both have attempted to model the decision-making processes of ex- perts in a number of fields. And both have used computers in the modeling process. However, despite similar aims, psycholo- gists and those working in mainstream A1 have approached the topic of human exper- tise very differently.

Within psychology, researchers have pursued three major themes. First, they have examined the accuracy of experts’ di- agnostic decisions. Second, they have fo- cused on the judgment process itself and have been interested in the degree to which a particular regression equation can be used to predict an expert’s decisions. And, third, they have considered the possibility of re- placing experts by linear regression models and have investigated the decision-making contexts in which these models lead to an increase in predictive accuracy (c.f. Dawes & Corrigan, 1974; Hoffman, Slovic & Rorer, 1968).

In contrast to the approach taken by

’ The author wishes to thank Warren B. Thorngate and Kellogg Wilson for their valuable comments on earlier drafts.

-

274

Behaworal Science, Volume T2, 1987

psychological researchers, those in A1 who are interested in human expertise do not attempt to assess the accuracy of experts’ diagnostic decisions. Instead they assume human experts are the ideal and that, on the basis of their knowledge, they are able to achieve an outstanding level of decision- making performance (Hayes-Roth, Water- man & Lenat, 1983). Thus, A1 workers view human expertise as a highly valuable and scarce resource. This belief provides them with the rationale for expending the large amounts of time, effort, and money neces- sary to tap as much of the complexity of experts’ knowledge and skill as possible, and to model it as an expert system.

When the products of the work on human expertise conducted in psychology and A1 are compared an intriguing paradox emerges. While psychologists are content to model experts’ decision-making proc- esses by simple linear equations, those in A1 favor the use of intricate and highly sophisticated expert systems. This paradox of simple versus complex decision-making models needs some form of resolution. Ex- pert systems are extremely expensive. They require at least a five-year commitment to develop, and they represent an investment of several million dollars, even before they reach the stage of being tested prior to

Page 2: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 275

commercial release (Davis, 1982, 1984; PO- ple, 1984). Given this, expert system con- struction can be justified only if they do a significantly better diagnostic job than the simple non-interactive regression equa- tions proposed by psychologists.

The critical question of how well expert systems do when pitted against the regres- sion equations that model experts’ deci- sion-making processes has not as yet re- ceived research attention. In fact, a t the moment a fair empirical comparison is hardly possible. In 1984 few of the major expert systems were in commercial use. Many were not even a t the stage of exten- sive testing and most, still required work on developing and debugging their knowledge bases (Davis, 1984). Two years later this situation was largely unchanged, a t least for the medical expert systems discussed in this paper (Jackson, 1986).

Although the simple versus complex par- adox cannot be resolved empirically, a t least not a t the present time, it can be addressed conceptually and this is the task undertaken here. The cases for each side will be outlined in some detail and then integrated in an attempted resolution of the paradox.

THE PSYCHOLOGISTS’ CASE FOR SIMPLE MODELS

In their research, psychologists have adopted a strategy reflective of the statis- tically based paradigm in the discipline. Thus, they analyze judgment tasks into a relatively small set of codable information sources, or cues, that are considered nec- essary to form an adequate judgment. These cues may be either objective as, for example, in the case of laboratory test re- sults, academic examination grades, or scores on the scales of a personality inven- tory. Or they may be subjective, requiring interpretation by the expert, as in the read- ing of an X-ray film or the estimation of the severity of disease in a patient. In either case, psychologists have investigated the accuracy of diagnostic decisions based on integration of these cues.

The accuracy of clinical judgment Although psychologists have investigated

the diagnostic accuracy of a wide variety of

experts, they have been particularly inter- ested in expert, clinicians, including both psychiatric and medical specialists. Per- haps the most striking finding of the stud- ies that have examined the accuracy of clinicians’ judgments is that, for a variety of judgment tasks, clinicians do not seem to be highly accurate a t all. While there is no doubt that some of them are excellent diagnosticians, Goldberg ( 1959) found that when diagnosing brain damage on the basis of Bender-Gestalt test results, experienced clinicians made the correct diagnosis only about 65% of the time. The result is robust. I t holds for judgments based on the Min- nesota Multiphasic Personality Inventory (MMPI) where the input information con- sists of scores on 11 personality scales con- sidered to be relevant for the diagnosis of psychiatric pathology (Meehl, 1954). I t was replicated by Hoffman, Slovic and Rorer (1968) who argued that the judgment di- mensions provided by assessment devices such as the MMPI may not, in fact, be those which clinicians naturally employ in the judgment task, and who had radiolo- gists themselves generate the dimensions that they considered to be important in the diagnosis of malignant ulcers. And, finally, it holds when the information on which judgments are based is not quantified but, instead, presented in the form of a verbal description (Oskamp, 1965).

Given that clinicians’ judgment and de- cision-making ability seems to leave a lot to be desired, psychologists have sought ways of improving their performance. Per- haps clinicians simply do not have enough information to do a good job. However, it seems, a t least in the case of Hoffman, Slovic and Rorer’s (1968) radiologists, that even when there is sufficient information available, only a fraction of it is actually used in forming judgments (see also Slovic, 1969; Ebbeson & Konecni, 1976). Phelps and Shanteau (1978) have demonstrated that experts are capable of using a larger fraction but that, in reality, relationships between sources of information render a number of these sources redundant. Thus, attention to a greater information pool is not necessary to make better judgments. In addition, Connolly (1980) has pointed out that cognitive limits will establish a firm

Behavioral Science, Volume 3”. 1987

Page 3: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

276 BARBARA CARROLL

upper bound on the number of information sources that can be attended to at once. However, overriding all this is the sugges- tion that increasing the amount of infor- mation on which judgments are based does not, in fact, translate into more accurate judgments. It merely implies greater confi- dence in mediocre judgments (Oskamp, 1965).

Related to this last point is the notion that clinicians are not more accurate in their judgments because they are not aware of their fallibility and therefore are not motivated to do anything about it. This question has been investigated by Goldberg (1959) and Lichtenstein and Fischoff (1976). Their results indicate that expert clinicians are overconfident in their assess- ment of their ability, but not to as great an extent as those with less experience and knowledge. Hence, they tend to be more aware of their limitations thaq their semi- or nonprofessional counterparts. They tend to underrate their ability in simple judg- ment tasks and overestimate their perform- ance as the tasks become more difficult. In this latter situation, their unjustified faith in the accuracy of their judgments extends equally to both diagnosed and misdi- agnosed cases.

Based on the premise that clinicians tend to place greater stock in their judgment ability than is warranted, Einhorn and Ho- garth (1978) have investigated some of the reasons why this might be so. They argue that the fault lies in the nature of judgment tasks in the real world. In the real world, judgments are formed so that decisions can be made and then acted upon. Under these conditions individuals assess the quality of their judgments by observing the conse- quences of the actions resulting from those judgments. The more frequently these con- sequences are in accord with their judg- ments, the better they perceived the judg- ments to be. Yet there are biases in this type of feedback information. Specifically, it is easy to find confirming instances of a correct decision but typically much more difficult to find disconfirming evidence. A patient who is diagnosed as healthy when, in fact, he or she has some serious pathol- ogy, is not liable to return and so prove the physician wrong. Einhorn (1978) and Ein-

horn and Hogarth (1978) have stressed that this bias is exaggerated by such factors as a tendency to ignore base rate and selection rate information, to overlook the possibility that feedback is contaminated by treatment effects, and to be highly susceptible to par- tial positive reinforcement. In addition, Connolly (1980) has pointed out that peo- ple have a tendency to rationalize what are, at the time incorrect decisions, so that in retrospect they are perceived to be correct.

Together these factors effectively inflate individuals’ perceptions of their judgmental accuracy; they become overconfident. This effect may be particularly pervasive for cli- nicians and similar groups whose judgment tasks are probabilistic in nature. When, for example, a patient has certain symptoms and blood test results he or she may have hepatitis with probability .7. In such situ- ations, Einhorn (1980) has stressed that it is genuinely difficult to assess the accuracy of judgments precisely because clinicians expect to be wrong a portion of the time. Therefore for them to revise their evalua- tion of their ability, they must somehow be able to judge whether they have, in fact, been wrong too often. And this is difficult when they are not aware of how the prob- ability estimates are derived, and when they make particular diagnoses infre- quently and without immediate feedback.

The message in the above arguments is clear: provide clinicians with more accurate feedback about the quality of their diagnos- tic decisions. Then, on the basis of this information they can learn, from their mis- takes, to form better clinical inferences. Of particular interest here is the suggestion by Fryback and Erdman (1979) and Klein- muntz (1982), among others, that com- puters can be useful in this regard. They enable assessment and follow-up informa- tion on a large number of patients to be stored, and can therefore provide clinicians with accurate statistics, such as the relative frequencies of various diseases.

Computers do, however, have their lim- its. Most important, they can handle only certain types of information. Therefore, while they help to provide clinicians with accurate feedback, they do not solve the problem of judgment inaccuracy. For the fact remains that accurate feedback cannot

Behavioral Science. Volume 32, 1987

Page 4: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 277

always be obtained. For example, a doctor may diagnose a patient’s illness and pre- scribe the appropriate medication. If the patient fails to improve, the clinician’s di- agnosis is not necessarily wrong: it may simply be that the patient is resistant to treatment and a different combination of medications is required. Similarly, if a uni- versity admissions’ committee accepts a student who later withdraws, the selection decision is not automatically invalidated. Some personal tragedy may have played a major role in the student’s decision and may even have lowered that student’s grades prior to quitting.

In summary, the psychological research on clinical judgment suggests that experts are not overwhelmingly accurate in their diagnostic judgments. There is no evidence that providing them with more information will rectify the situation. And furthermore, because accurate feedback on the effective- ness of experts’ decisions is unavailable in many contexts, they are unlikely to learn to improve their diagnostic ability in those contexts.

Faced with these results, many psychol- ogists have tried to improve the decisions experts make by identifying and then at- tempting to eliminate sources of judgmen- tal error. Thus, psychologists have focused on how experts integrate the sources of information available to them when they form the judgments that underlie their di- agnostic decisions, and have sought math- ematical models of this integration process which outperform the experts on whom they are based.

Appropriate models of experts’ deci- sion-making processes According to Meehl (1954), in any deci-

sion-making context experts can integrate information in two possible ways. Judg- ments may be nonmechanical and reliant on experience, intuition, and hunches. Al- ternatively, cue combination may be me- chanical. In this case, based on the lens model proposed by Brunswik (1952) and Hammond (1955) and his associates (Hursch, Hammond & Hursch, 1964), psy- chologists have assumed clinical judgment can be represented by a regression equa- tion.

Given that experts may integrate infor- mation clinically or statistically, psycholo- gists have questioned whether one method is superior to the other in terms of its predictive accuracy. When clinical judg- ment is pitted against actuarial methods the results are unambiguous: statistical prediction is at least as good as clinical prediction (Dawes & Corrigan, 1974; Dawes, 1976; Meehl, 1965). Although the value of clinical intuition is recognized when, for example, input information in- volves feelings that are non-codable, the demonstrated superiority of actuarial methods of codable cues has meant that intuitive factors have received little atten- tion from most psychologists working in the area of clinical judgment (c.f. Dawes & Corrigan, 1974; Meehl, 1973).

In 1960 Hoffman went beyond the idea that a regression equation could be used to predict clinical or selection outcomes. He suggested that a simple linear regression model could be used as a “paramorphic representation” or approximation of an ex- pert’s underlying decision making proc- esses. Then, according to the assumptions of this model, the regression Beta-weights indicate the emphasis judges place on in- dividual cues.

A number of psychologists have argued that this additive model is too simple and that clinicians process information confi- gurally rather than independently, as as- sumed by the model. However, using AN- OVA techniques, researchers concerned with this issue have generally failed to find evidence of significant configural process- ing, even when the judgment task has been specifically designed to promote it (Hoff- man, Slovic & Rorer, 1968). Thus, it is generally agreed that a wide variety of judg- ment tasks can be adequately represented by a simple linear model (Hammond, Hursch & Todd, 1964; Slovic, Fischoff & Lichtenstein, 1977). Anderson (1972) goes as far as equating this linear model with clinicians’ actual judgment processes. How- ever, the majority opinion supports Hoff- man’s (1960) notion that a linear model is simply an approximation of those underly- ing psychological processes (see Goldberg, 1968; Meehl, 1954; Rorer & Slovic, 1966). Goldberg (1968), in particular, has argued

Behavioral Science, Volume :12, 1987

Page 5: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

278 BARBARA CARROLL

that clinicians may, in fact, process infor- mation in a configural manner but seem not to only because a linear model is robust enough to reproduce most of the resultant judgments with very little error.

Given that experts’ decision-making processes can be represented by a simple linear model, researchers have found that, in many contexts, these paramorphic models have greater predictive accuracy than the experts themselves (see Bowman, 1963; Goldberg, 1970; Yntema & Torger- son, 1961). This led Goldberg (1970) to propose a technique known as bootstrap- ping wherein clinicians are replaced by the regression equation that models what they do.

Dawes (1971,1976,1977,1979), based on his work on university graduate admissions selection procedures, has been particularly interested in why bootstrapping works and in what decision-making contexts it is lia- ble to be of greatest utility. He, and others, argue that bootstrapping works simply be- cause a regression equation is an abstrac- tion of the process it models and therefore captures the principles experts use without their unreliability (Bowman, 1963; Dawes, 1971). Thus, it eliminates sources of judg- ment error that arise when experts are tired, bored, or preoccupied with personal problems (Goldberg, 1970).

Dawes (1971) also argues that bootstrap- ping is most valuable in decision-making contexts where there is an error of meas- urement in the predictors, where criteria are “noisy,” or have a degree of error asso- ciated with them, and where deviations from optimal weighting do not make much of a practical difference. In these situations, as long as there is a conditionally monotone relationship between each input variable and the criterion variable, experts’ linear paramorphic models have substantially greater predictive accuracy than the ex- perts themselves (Dawes & Corrigan, 1974).

The types of decision-making contexts just described are common in psychology. They occur in areas such as psychiatric and medical diagnosis and apply to problems such as selecting applicants to graduate school (Dawes, 1971). Furthermore, Dawes and Corrigan (1974) have demonstrated that in such contexts it is not necessary to

approximate clinicians by their regression models. It is sufficient to replace them by any linear model. As long as the signs of the Beta-weights are correct, and the mon- otonicity between predictors and criterion preserved, their magnitude may be random. Or, more useful in a population that is changing over time, unit weights can be used without loss of accuracy (Dawes & Corrigan, 1974; Einhorn & Hogarth, 1975).

These types of bootstrapping models us- ing non-optimal (in the least squares sense) methods of cue combination are not only valid for predicting numerical criteria. They are also valid, in contexts such as those just described, when there is no mea- surable outcome variable and the judgment task is one of establishing preferences (Dawes, 1977).

The implications of clinical judgment research The results of the research on clinical

judgment suggest that experts are not ex- tremely accurate in their diagnostic deci- sions and that in many of the decision- making contexts studied by psychologists bootstrapping, using extremely simple lin- ear models, can provide diagnoses that are both more reliable and more accurate. The major implication of these findings is that, at least in “noisy” decision-making con- texts, clinicians can and should be replaced by these linear models so that diagnosis is strictly a mechanical matter (Dawes, 1976; Meehl, 1954). Given this, the value of com- puters in the diagnostic domain is obvious. They provide quick and reliable diagnoses. But, more important, when Beta rather than unit weights are used, sophisticated computer systems can derive them on the basis of large data bases of patient infor- mation. This procedure does not, of course, eliminate the fallibility of that subset of the data base of relying on clinicians’ judg- ments (e.g., X-ray diagnoses). However, it does have two advantages.

First, the data base can be easily updated to include case history and diagnostic in- formation on new patients as they are treated. Therefore, the validity of the regression coefficients can be checked from time to time and adjusted to reflect changes when necessary (Kleinmuntz, 1982). Sec-

Behnviornl Science. Volume 32, 1987

Page 6: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 279

ond, computers can be programmed to de- termine the relative frequency of various diseases. Thus, they can calculate the prob- ability of each of a patient’s potential di- agnoses. Furthermore, they can perform this task accurately, thereby eliminating the biases to which clinicians are subject when they attempt the same calculations (see Lichtenstein & Fischoff, 1976; Fryback & Erdman, 1979; Einhorn, 1980).

From a psychological perspective, then, simple linear regression models and com- puter technology seem to hold the key to better diagnostic decisions. There is one problem, however. Psychological research- ers and, in fact, people in general are simply not prepared to have clinicians’ diagnostic role replaced by a machine (Kleinmuntz, 1982). Clinical intuition and experience seem to count for something. Consequently, psychologists have spent some time dis- cussing and trying to redefine the clini- cians’ role.

Fryback and Erdman (1979) credit cli- nicians with the central role. They make the diagnostic decisions and computers simply provide realistic, unbiased feedback on those decisions. Kleinmuntz (1982) also sees clinicians in the primary role with computers, as described so far, serving strictly as diagnostic aids that calculate Beta-weights and probabilities and are es- sentially occupied with number crunching.

For Dawes (1976) the emphasis on ex- perts’ decision-making abilities is, in many situations, little more than cognitive con- ceit, coupled with our unwillingness to ac- cept the fact that there are cognitive limits to our decision-making talents. In “noisy” decision-making contexts, such as the se- lection of graduate students, he argues that computerized models should make the de- cisions, with admissions’ committees over- riding these decisions only for social rea- sons such as ensuring adequate minority group representation.

Meehl (1954, 1973) is also comfortable with allowing the computer to diagnose when the data are objective and their com- bination is mechanical. He argues that in this realm diagnostic decisions are routine, clinical intuition does not have a role to play, and clinicians can be more effectively

employed in those aspects of diagnosis that a computer cannot do. Thus, clinicians can observe and interview patients, be suppor- tive and empathic, exercise their intuition and test out hunches, and make strategic decisions about the sequence of diagnostic tests and treatment steps.

Finally, several psychologists stress the need for a more integrated relationship be- tween clinicians and computers. Einhorn (1972), for example, points out that com- puters can use objective, quantified input data, such as MMPI scores or blood test results. Alternatively, they can make use of qualitative information, provided a clini- cian subjectively interprets and codes that data to ensure monotonici, . between pre- dictors and criterion. When this type of clinical information is included as input to the computer, the resultant diagnosis can be said to capture, at least minimally, the clinician’s knowledge and experience (see, also, Dawes, 1977). Hence, clinicians and computers become more interdependent.

Yntema and Torgerson (1961) carry this interdependence one step further when they claim “man and computers could co- operate more efficiently . . . if a man could tell the computers how he wanted a decision made, and then let the machine make the decision for him” (p. 20). This idea is a critical one especially if, as Johnson, Has- sebrock, Duran, and Moller (1982) have suggested, expert clinicians differ from novices in the reasoning strategies they employ to make their diagnostic decisions.

This type of interdependence may be achieved, as Dawes (1979) has suggested, by allowing experts to establish the coeffi- cients of the computerized regression model on the basis of intuition. Alternatively, the simple linear model may be abandoned as dehumanizing in favor of an approach which focuses more directly on experts’ de- cision-making processes per se. Whether or not this is an example of cognitive conceit, it is precisely the approach taken by Klein- muntz (1968,1975,1982) when he had both clinical psychologists and neurologists ex- plain each step of their respective decision- making processes. Moreover, it is this em- phasis on how decisions are reached, to- gether with the length of time involved in

Behavioral Science, Vulurrie 32. 1987

Page 7: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

280 BARBARA CARROLL

eliciting and editing ‘talk aloud’ protocols, that prompted Leal and Pearl (1977) to devise an interactive computer program ca- pable of capturing the ways in which deci- sion makers structure their knowledge in order to make their professional decisions. Finally, and most important, it is this kind of perspective that spawned expert sys- tems.

EXPERT SYSTEMS: THE CASE FOR COMPLEX MODELS

Expert systems are sophisticated and highly specialized computer programmes that try to capture and reproduce experts’ decision-making processes (Naylor, 1983). Their focus is on the knowledge experts possess and the ways in which they use it to solve problems within their specialties. Thus, expert systems are, in technical terms, knowledge-based. Furthermore, their power lies in this knowledge in the sense that the more accurately they reflect experts’ knowledge, the better they are con- sidered to be (Davis, 1984).

The way in which this knowledge is ex- plicated and then programmed is quite spe- cific. First of all, expert systems do not represent all types of knowledge and cannot model all decision-making tasks, or all ex- perts. They do not deal with the type of firm, fixed, and formalized knowledge that can be handled within the scope of algo- rithmic programmes (Buchanan et al., 1983). Instead, they take a heuristic ap- proach, based on the assumption that real- world decision problems are solved only semilogically. Experts do not reason every decision from first principles; they take short cuts. Often they lack sufficient infor- mation to allow them to make purely logical decisions. However, over and above this, experts in a number of areas rely heavily on experience, professional judgment, and hunches in making decisions. It is just this type of subjective and rather elusive knowl- edge that expert systems attempt to capture (Buchanan et al, 1983; Duda & Gaschnig, 1981).

While the type of knowledge on which experts base their performance is one pre- requisite for a knowledge-based expert sys- tem, there are others. The decision problem

or task that is to be programmed as an expert system must be one that requires specialized knowledge or expertise. Hence, those who are recognized as experts in the area must outperform novices. Further- more, within the set of experts i t must be possible to identify at least one individual who is acknowledged to be particularly pro- ficient in his or her decision-making capac- ity. In addition, the task domain must be such that experts can explain both the knowledge they require and the methods they use to apply it to particular decision problems in that domain. This will be the case, for instance, for tasks that are rou- tinely taught to students or apprentices.

Other prerequisites for knowledge-based systems stem from consideration of ma- chine limitations and the time and effort involved in constructing and implementing such systems. As yet expert systems can only deal with very narrow domains of ex- pertise so that appropriate tasks are ones that have a specific, well-bounded domain of application. Furthermore, given the hu- man and financial costs of expert systems mentioned earlier, the decision-making task to be modeled must be one that is commercially viable (Brachman et al, 1983; Davis, 1984; Duda & Gaschnig, 1981).

Although the list of prerequisites seems prohibitive, a fairly wide variety of expert systems has been developed so far. For instance, PROSPECTOR is a geological expert system that focuses on mineral ex- ploration problems. It estimates the chances of finding certain minerals in a particular location and gives a rough guide to the quantity of each (Duda et al, 1978; see also Naylor, 1983). And DENDRAL infers chemical molecular structures on the basis of spectrographic information (Bu- chanan & Feigenbaum, 1978). However, perhaps the most popular area of expert system development is that of medical di- agnosis. In addition to CADUCEUS (also known as INTERNIST), medical expert systems include MYCIN, which diagnoses and suggests treatment for infectious dis- eases, and PUFF, which diagnoses pulmo- nary diseases (Shortlife, 1976; Kunz et al, 1978, respectively; see also Johnson & Ker- avnou, 1985).

Behavioral Science. Volume 32. 1987

Page 8: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 28 1

Expert system construction

The simplest way to understand expert systems, as complex models of experts’ de- cision-making processes, is to observe how they are built. There are two aspects of this construction. The first involves the elici- tation of an expert’s specialized knowledge and decision heuristics. And the second concerns the computer representation and organization of that expertise.

The knowledge base. The most popular way of representing knowledge in an expert system has been to use production, or sit- uation-action rules consisting of IF . . . THEN statements (Clancey, 1983; Duda & Gaschnig, 1981). There are alternatives. For example, Aikins (1983) has constructed CENTAUR, with a domain of pulmonary physiology, as an expert system whose knowledge base incorporates prototype rep- resentations (see also Jackson, 1986). How- ever, alternatives such as this are very com- plicated technically and, on these grounds, expert system designers have generally been content to opt for a rule-based ap- proach which assumes experts’ rules of thumb are the most important component of their expertise.

The construction of an expert system normally requires an expert in the domain of interest and a knowledge engineer, or specialized programmer, to act as an inter- mediary between expert and computer. All of the systems built so far have relied on a single expert and, with one exception, there has been no discussion of how the expert involved was selected (Davis, 1984). The exception is the internist on whom CAD- UCEUS was modeled. He was concerned about capturing the type of generalized ex- pertise traditionally used by internists and simply volunteered his services (Pople, 1984). While the expert’s role in the con- struction team is well-defined, that of the knowledge engineer is more flexible. There are expert systems, such as TEIRESIAS, which focus on the domain of knowledge acquisition and lessen the dependence on knowledge engineers by allowing expertise to be transferred interactively from human expert to machine (Davis & Lenat, 1982; Politakis & Weiss, 1984). However, these presuppose that experts know something

about the program’s internal representa- tional structures. And it has been suggested that usually they do not (Duda & Gaschnig, 1981). Thus, the expert-engineer partner- ship is, in practice, fairly well-established.

Within this partnership the expert iden- tified his or her knowledge and the pro- grammer transforms it into a machine-use- able form. The stages of this process have been outlined by Hayes-Roth et al. (1983), Buchanan et al. (1983), and, more techni- cally, by van Melle (1981). It sounds rela- tively straightforward. The boundaries of the problem domain are identified, the con- cepts relevant to that domain are elicited, the rules relating subsets of these concepts are specified and the hierarchical organi- zation of the rules clarified. The rules are then formalized as IF . . . THEN state- ments and organized hierarchically by a control structure. The resultant program is tested and, with the expert’s help, the rules validated. The most reasonable strategy is to start small and construct a skeleton ex- pert system. When this is working satisfac- torily more knowledge (rules) can be added to flesh it out and, in the process, the control structure will usually be modified to reflect new and changed organizational links between rules.

The process is not as easy as it appears, however. The type of knowledge that expert systems try to capture and reproduce is generally not well structured, particularly at the level of detail required for the con- struction of such a system. Extracting the knowledge may not be too difficult. Typi- cally arranging it in a systematic fashion proves more of a challenge (Buchanan et al, 1983). Together, these two tasks repre- sent a considerable cognitive load and in- vestment of time for the expert involved. For instance, Pople’s (1984) internist con- tributed about three hours a week for six months to help build the skeleton of CAD- UCEUS.

The above prescription for expert system construction does more than oversimplify the mechanics of knowledge acquisition. In addition, it fails to do justice to the com- plexity of the program that models the ex- pert’s knowledge. To rectify this a more detailed account of the machine represen- tation of this knowledge is required. The

Behavioral Science. Volume :32, 19x7

Page 9: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

282 BARBARA CARROLL

emphasis here will be more conceptual than technical. More complete general descrip- tions are available in van Melle (1981), Davis & Lenat (1982), or Clancey (1983). For a detailed description of the knowledge representation in MYCIN see Shortliffe (1976).

While an expert system may consist of up to seven components, only two of these are common to all existing systems (for descriptions of the remaining five specialty components see Hayes-Roth et al, 1983). All contain a knowledge base and a control structure which is alternately referred to as an inference engine or an interpreter. The knowledge base is usually composed en- tirely of rules: expert systems typically do not store raw data on individual cases. In many of the expert systems constructed so far these rules are exclusively produc- tion rules. MYCIN is the largest of these systems, with a total of about 450 rules (Shortliffe, 1976; van Melle, 1981). Each rule is a chunk of knowledge that is self- contained but which, by itself, cannot be used to form a diagnosis. Diagnoses are made on the basis of sets or chains of rules. Each rule is of the form IF antecedents A, B and C are satisfied THEN the conclusion D can be drawn. But it is more complicated than this even: conclusion D can rarely be drawn with certainty. Instead it holds with probability p (Davis, 1984).

As an example, consider the following MYCIN rule for identifying the organism responsible for a particular infectious dis- ease:

If 1) the gram stain of the organism is gram negative, and

2) the morphology is rod, and 3) the aerobicity of the organism is

Then there is suggestive evidence (.7) that the identity of the organism is Bac- teroides (Davis, 1984, p. 34)

anaerobic

Although the content may not mean much outside the field of bacteriology, there are several points that deserve mention. First, the probability (or, in technical terms, the certainty factor) of .7 is estimated by the expert clinician involved in the expert sys- tem construction. Moreover, the estimate

is not mathematically but, rather, experi- entially determined. Second, while expert systems allow the rules to be expressed, as above, in English, this is not the language the computer uses. It works with highly esoteric languages such as LISP or INTER- LISP (Naylor, 1983; van Melle, 1981).

The third and final point deals with the ways in which the three antecedent condi- tions are fulfilled, and leads nicely into a discussion of the expert system’s control structure. There are two means by which the premises of any production rules may be satisfied. The first, lower-order method is simply to ask the user, in this case typi- cally a physician, for the information di- rectly. Thus, in a consultation interview MYCIN may ask, for instance: Is the or- ganism a rod or coccus? The physician then types in the answer, say rod. Or he or she may type in “Unknown” in which case the rule above fails unless the conclusion can be reached with a lower probability on the basis of conditions (1) and (3) alone.

The second method of satisfying ante- cedent conditions is of a higher order. It involves employing the system’s own knowledge base. Thus, in order to satisfy condition (1) the programmer effectively asks: What is the gram stain of the orga- nism? However, instead of seeking the an- swer from the physician, as user, all the rules relating to the gram stain are called up and each is interpreted. In formal terms, the pursuit of the goal of diagnosing the organism is temporarily postponed while the subgoal of determining the gram stain is pursued (Davis, 1984; van Melle, 1981).

The two methods of satisfying the ante- cedent conditions of rules (rule interpreta- tion) are employed together in an expert system. The system does what it can on the basis of its rule-based knowledge and, when it gets stuck, it asks the user to provide the necessary information. Obviously some overall organization is required. For exam- ple, there must be a way of knowing which rules are related to the gram-stain property of an organism so that they can be called up without simply churning through the entire set of rules. And there must be a way of knowing when there are no pertinent rules and the user must provide the infor- mation. Finally, there must be a way of

Behavioral Science, Volume 32. 1987

Page 10: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 283

keeping track of where one is in the diag- nostic process, and how the various strands are related to each other. These organiza- tional tasks are the province of the second major component common to all expert systems: the control structure, rule inter- preter, or inference engine (Davis, 1984).

The control structure. The control struc- ture consists of meta-rules (van Melle, 1981; Clancey, 1983). As an illustration, consider the following meta-rule from MY- CIN:

If 1) the infection is a pelvic abscess, and 2) there are rules which mention in

their premise enterobacteriaceae, and

3) there are rules which mention in their premise gram-pos-rods,

Then there is suggestive evidence (.4) that the former should be done before the latter (Clancey, p. 238).

The function of the control structure, then, is to manipulate the rules that com- prise the knowledge base. The implication is that control structure and knowledge base are separate components. While such independence is not essential, it is, from a programming perspective, by far the easiest approach (Davis, 1984). In fact, it is just this distinction between knowledge base and control structure that differentiates ex- pert systems from other types of computing programs (Duda & Gaschnig, 1981; Hap- good, 1985).

All expert system control structures ma- nipulate knowledge-base rules according to one of two strategies. Either they employ what is known as forward-chaining or, al- ternatively, they use backward-chaining. Of the two, backward-chaining, or conse- quent reasoning, is by far the most popular (for descriptions of forward-chaining pro- cedures see Duda & Gaschnig, 1981; van Melle, 1981). It is the strategy employed by MYCIN and, hence, by most of its deriva- tives (e.g., PUFF).

Backward-chaining is goal-directed. It assumes that diagnostic problems can be represented by a tree diagram (a goal tree), with a diagnosis (goal) at the apex, and branching out into increasingly specific

configurations of signs and symptoms nec- essary for the diagnosis. The procedure be- gins with a diagnosis. This is the physi- cian’s best guess (c.f., condition one of the meta-rule above). It is entered into the computer and the expert system then tries to confirm it. To do so the control structure calls up all rules relating directly to the diagnosis of, say, pelvic abscesses. These will be fairly general rules. Next, the con- trol structure tries to satisfy each of these rules. When it runs across a rule that can- not be satisfied, that rule is considered to be a subgoal and the control structure calls up all the more specific rules that must be fulfilled in order to achieve that subgoal, and so on. Obviously backward-chaining is a “depth-first’’ technique. It is methodical and in that sense it is an efficient way of arriving at a diagnosis. However, it is costly in terms of the amount of information it requires from the user. If the same knowl- edge-base rule occurs in two different con- texts and if that rule needs a piece of infor- mation, then the user will have to provide that piece of information twice (Davis & Lenat, 1982; Duda & Gaschnig, 1981; van Melle, 1981).

To illustrate, consider an example from MYCIN’s repertoire. A particular infec- tious disease may be caused by one of four organisms. MYCIN considers each possi- bility in turn. In trying to identify orga- nism-1 it will, among other things, ask for the aerobicity of the organism. The physi- cian will type it in. If, in the end, there is insufficient evidence to conclude that the diagnosis is attributable to organism-1, then organism-2 will be considered. Again the aerobicity will be requested and again it must be supplied.

The above discussion of expert systems, how they are constructed and how they work, has been a fairly general one. Nev- ertheless, even at this level it is apparent that expert systems are highly sophisti- cated (and becoming even more so) and represent complex models of experts’ deci- sion-making processes in a number of rather narrowly defined specialty areas. Furthermore, the way in which this spe- cialty knowledge is represented in the com- puter is not primarily dictated by the na-

Behavioral Science. Volume :I& 1987

Page 11: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

284 BARBARA CARROLL

ture of the expertise. Instead, it seems to be largely a function of constraints imposed by existing computer technology and pro- gramming techniques. It is still the case that in most expert systems knowledge is represented by production rules, with a sep- aration of the knowledge base and control structure. This is not necessarily the most appropriate or the most flexible knowledge representation. It is simply the one that is most tractable from a programming per- spective.

The evaluation of expert systems The preceding discussion provides some

understanding of the type of decision-mak- ing model captured in an expert system. Now the question is: How good are expert systems? There are at least two ways of interpreting this question. A good expert system may be one in which there is a close match between the system and the expert on whom it is modeled. Alternatively, a good expert system may be one that out- performs those professionals who would make use of such a system.

Experts and expert systems: How close is the match? Within the field of Al, a number of researchers have recognized that the de- gree of concordance between the expert and the model of the expert may be modest. In many of the existing expert systems the knowledge base contains only production rules and Aiken (1983) and Davis (1982), among others, claim that this type of knowledge representation is too limited for many diagnostic problems. When this is the case, an expert system relying on pro- duction rules will not model its expert closely and, given that the human expert is assumed to be the ideal, it will therefore not be judged to be as good. Consistent with this there is a recent trend in A1 toward expert systems that combine production rules with more general rule-based knowl- edge representations (see, e.g., Hayes-Roth et al., 1983). However, even with these modifications, the match between experts and expert systems may not be good. Al- though experts do use rules to solve their diagnostic problems, Dreyfus (1972) has suggested that they do not use only rules.

Thus, expert systems, as they exist now, will inevitably miss some-perhaps impor- tant-aspects of clinicians’ decision-mak- ing processes. The nature of these limita- tions are acknowledged by some expert system proponents (see, e.g., Winston & Prendergast, 1984). But there seems to be no discussion of the seriousness of those limitations. There is only an underlying optimism that future technological ad- vances will make it possible to approximate the activities of experts more closely.

Assuming, for the moment, that clini- cians’ diagnostic decisions are rule-based, the question of how adequately these rules can, in fact, be captured in an expert system arises because of the nature of the decision problems faced by clinicians. In very many instances these tasks are ill-structured. Cli- nicians are often required to make decisions on the basis of incomplete information. And their job is further complicated by the fact that the set of information they do have is not standard data are collected in a different order for different patients so that, given limited time, the set of missing data will vary across patients.

If the set of information clinicians use to form diagnoses varies from patient to pa- tient, even when those patients actually have the same disease, then their resultant decision rules would be expected to be ex- tremely complex. Hapgood (1985) has sug- gested that they fall into two categories. The first contains more general rules that work for routine cases. And the second contains those rules which govern the ex- ceptions and qualifications to the normal rules of the first type. It is these rules that it seems most difficult to elicit. And it is precisely these rules that are critically im- portant, for they provide clinicians with diagnostic flexibility.

The difficulty of elucidating this second category of rules is acknowledged by those who build expert systems. Even though one of the prerequisites for an expert system is that the expert be able to explain his or her specialized knowledge and experience, a large proportion of the knowledge engi- neer’s task is still devoted to helping the expert structure that expertise into an in- terlocking network of rules (Buchanan et

Behavioral Science, Volume 32, 1987

Page 12: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 285

al., 1983; Duda & Gaschnig, 1981). Part of the problem seems to be that clinicians use rules, particularly those in the second cat- egory, without being fully aware of what those rules are. The assumption that knowledge engineers operate under is that it is possible, through discussion, to make clinicians conscious of these rules. They can then be captured in an expert system. Some psychologists would challenge this assumption. For instance, Nisbet and Wil- son (1977) would argue that clinicians do not necessarily become aware of the rules they use. Instead, they may simply invent rules which seem plausible within the de- cision-making context. These “invented” rules may, in fact, work. However, at the time of expert system construction their validity has not been established by prior experience as is the case for the other “non- invented” rules. Instead, it must be as- sessed in terms of future diagnostic appli- cations.

Dreyfus (1972) has stressed the impor- tance of the decision-making context. He argues that regardless of whether the rules elicited from clinicians are actual or in- vented, they are embedded in a certain context. In the interaction between expert and knowledge engineer, the decision rules are removed from this context and rein- serted into a context which is dictated by the knowledge representation of the expert system. Dreyfus warns that the two con- texts are not synonymous and, to the extent that they differ, distortions would be ex- pected to occur in the translation of rules from expert to machine.

In summary, then, it appears that expert systems may be limited in their ability to capture experts’ decision-making proc- esses. At present most of them are capable of capturing only rules, and these comprise only a portion of clinicians’ activities. Fur- thermore, they may have to be content with accepting the fact that the rules they do capture may well be distortions of those clinicians actually use. Thus, expert sys- tems are generally not superb models of human expertise. This does not, however, automatically imply that they are inaccur- ate in their diagnostic decisions.

The accuracy of expert system diagnosis. Published empirical evidence on the level

of performance attained by expert systems is relatively hard to come by. Much more common are statements, presumably based on experience, attesting to the systems’ ca- pabilities. Thus, Brachman et al. (1983) claim that no existing expert system comes close to the ideal, exceptionally high level of performance supposedly achieved by hu- man experts. However, it seems that the medical expert systems do a reasonably good diagnostic job within their domains. Members of the medical profession con- sider MYCIN’s diagnostic and treatment judgments to be as good as their own (Davis, 1984; Naylor, 1983). And PUFF is able to produce diagnoses that are accepted by clinicians between 90% and 95% of the time although it is unclear whether the clinicians would actually have made the identical diagnosis themselves (Davis, 1984; van Melle, 1981). Finally, Pople (1984) claims that CADUCEUS does a fairly good diagnostic job in internal medi- cine.

This somewhat informal evaluation of medical expert systems indicates that, at least in some medical specialties, expert systems can approach but not exceed the diagnostic ability of clinicians. However, Yu et al.’s (1979) results would dispute this. In their study independent evaluators com- pared MYCIN’s performance in the treat- ment of meningitis to that of clinicians with varying degrees of expertise. The results favoured MYCIN: its prescriptions were more often rated as acceptable by the eval- uators than were those of the clinicians. They were even rated as slightly better than the prescriptions of the best clinician. How- ever, lest the picture look too rosy, it should be emphasized that only 65% of MYCIN’s diagnoses were acceptable (versus an aver- age of 55.5% for the clinicians).

Overall, the discussion of expert systems presented here demonstrates that they are extremely complex models of experts’ de- cision-making processes but that the degree of congruence between expert and model is often limited. And, furthermore, while MY- CIN may outperform clinicians, it is more often the case that the accuracy of expert systems is bounded by the performance levels attained by the experts on whom they are modelled.

Behaviorol Scienre. Volume 32, 1987

Page 13: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

286 BARBARA CARROLL

A RESOLUTION OF THE SIMPLE VS. COMPLEX PARADOX

The paradox of simple versus complex models of experts’ decision-making proc- esses arises as a result of applying two different research traditions to the same underlying problem. Psychologists base their linear models on sources of informa- tion that are codable. Hence, they capture experts’ experience only minimally, in the coding of subjective input variables, and they exclude the role of clinical intuition. Those in Al, on the other hand, credit ex- perience and intuition with a central role and they go to considerable lengths to cap- ture these qualities in the form of decision heuristics. The question is: Are expert sys- tems worth all this effort? Do they do a better job of making expert decisions than the models proposed by psychologists?

Yu et al.’s (1979) results suggest that while MYCIN’s performance is not out- standing, it is comparable with the level of judgment accuracy achieved by clinicians in psychological research. However, while expert systems can perform as well as hu- man experts, they typically do so only in an ideal research environment where they are required to make decisions on the basis of refined “textbook” cases that are well within their problem domains. In real- world settings where they must deal with, for example, actual patients who provide case histories in which symptoms are mis- represented and relevant information for- gotten, expert systems attain only very modest accuracy levels (c.f. Davis, 1984; Meehl, 1954; Pople, 1984). Similarly, at the boundary of their specified problem do- mains, the quality of expert systems’ diag- nostic decisions declines much faster and to a much lower level than it does for hu- man experts (Davis, 1984). This is to be expected. Whereas experts have some back- ground knowledge outside their specialty, expert systems do not.

The conclusion to be drawn from this is that when any decision problem is at- tempted under realistic conditions contain- ing noise or error, expert systems will fare significantly worse than experts. Within a medical context, for instance, when pa- tients misrepresent their symptoms, clini- cians provide the expert system with faulty

information. And when decision problems test the limits of an expert system’s prob- lem domain, some of the rules needed in the diagnostic process will simply be absent from the system’s knowledge-base and the system will therefore be unable to request the information required to make an ade- quate diagnosis. The result of these two sources of error will be that the expert system can confirm a number of diagnoses, each with low probability. This leaves cli- nicians in a quandary over the most appro- priate course of treatment and, in choosing between alternatives, error is introduced into the criterion.

If expert systems can do as well as ex- perts under ideal conditions, but are less accurate in those real-world settings that introduce error into the diagnostic process, then the question is: How does the perform- ance of these same expert systems compare with that of the linear decision models pro- posed by psychologists? And the answer seems to depend upon the nature of the decision problem.

Psychologists such as Dawes (1971) and Goldberg (1970) claim that the predictive accuracy of linear models is greater than or equal to that of expert judgment because such models eliminate important sources of error in the decision-making process. If this claim is valid, experts can be expected to be about as accurate as their paramorphic models when the decision problem is inher- ently noise-free.

In a medical context, for instance, it is often the case that there is very nearly a one-to-one relationship between diagnosis and treatment outcome. Therefore, it fol- lows that when patients present their symptoms accurately, when physicians have important test results available, and when the diagnostic problem is well within an expert system’s domain, expert systems can be as accurate as their experts who, in turn, can be as accurate as their linear models. Thus, expert systems can do as well as the linear models favored by psycholo- gists for noise-free decision problems under ideal conditions. In real-world settings that are noisy, however, linear models have the edge over expert systems, even when the decision problem is error-free in nature.

The situation is considerably different in

Behavioral Science, Volume 32. 1987

Page 14: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 287

decision-making contexts that are inher- ently noisy. In selecting graduate students, for example, there is only a fuzzy relation- ship between present academic qualifica- tions and future academic success (Dawes, 1971, 1976). Similarly, when there is no measurable criterion and the decision prob- lem is to establish preferences, the analo- gous relationship is complex and weak (Dawes, 1979). In cases such as these, ex- perts are not and cannot be expected to be very accurate. However, it is in these con- texts that bootstrapping works and even linear models with unit weights outperform experts (Dawes, 1977, 1979; Dawes & Cor- rigan, 1974). Given that expert systems typically are no more accurate than experts, the conclusion that, for noisy decision problems, the performance of expert sys- tems is inferior to that of very simple linear models is justified.

If the conceptual conclusions drawn here are substantiated empirically, expert sys- tems are likely to perform as well as linear decision models only for problems and in contexts that are noise-free. This assertion, if correct, seriously undermines the ration- ale for expert system construction. It does not, however, automatically follow that ex- pert systems should be abandoned in favor of linear models. If there is some way of modifying expert systems to improve their accuracy, they may still be competitive with psychology’s simple alternative. From the reviews of the literature in psychology and A1 just presented there are a number of possibilities for such improvement.

As already mentioned, expert systems perform very badly when the limits of their knowledge bases are tested. Thus, their ac- curacy may be improved by adding produc- tion rules to expand the scope of their knowledge base. This is the approach taken by knowledge engineers in Al. Rules are modular so their addition should be a sim- ple matter. In practice it is not. There is no problem wit,h the knowledge base. It is in the control structure that the difficulty arises. The major function of the control structure is to determine the order in which the branches of the decision tree are to be searched and to guide the decision-making process within each of these branches.

When new rules are added both the order in which the branches are to be searched and the decision-making strategy within one or more branches are changed. Thus, a great number of the control structure links between production rules must be changed. This process is eased to a degree by systems such as TIERIASIS. Nevertheless, these systems offer somewhat limited explana- tions of control structure functioning, and altering the control structure of an expert system is still an exceedingly complicated process (Buchanan, 1983; Davis, 1984).

Expanding the knowledge base, however, is not the only possible strategy for improv- ing the decision-making ability of expert systems. A t least two others are suggested by the psychological literature. First, the level of performance of an expert system is bounded by the talents of the expert on whom it is modeled. If that expert clinician misapplies rules, or uses rules that are wrong, then the expert system will do like- wise. While those in A1 assume all experts are highly accurate, psychologists caution that only some experts are actually excel- lent diagnosticians (see, e.g., Goldberg, 1959). Obviously, then, it is crucial to de- velop means of identifying one of these outstanding clinicians and having him or her serve as the source of the expert system knowledge base (see Kleinmuntz, 1968, for recognition of this point).

The suggestion is promising and perhaps suffers from only one major drawback. As already mentioned, the construction of an expert system requires a considerable com- mitment on the part of the expert clinician (Pople, 1984). Good diagnosticians are usu- ally also extremely busy consultants and may well be unwilling to invest the time and effort required. To the extent that this is so, knowledge engineers will have to sa- tisfice and model their expert systems on those with more diagnostic skills.

The second way in which psychologists can contribute to the improvement of ex- pert systems has to do with the certainty factor associated with each production rule in the knowledge base. These are judgments of the plausibility of a particular rule (Hayes-Roth et al, 1983). They are not, strictly speaking, probabilities but they are

Behavioral Science. Volume :12. 1987

Page 15: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

288 BARBARA CARROLL

linear transformations of them (Davis & Lenat, 1982). Furthermore, they are pro- vided by the specialist or clinician on whom the expert system is based and reflect his or her experience. Therefore, the psycho- logical results which suggest that clinicians are not well calibrated, or not very good at estimating probabilities, are pertinent. In fact, these findings may have particular relevance because it seems likely that knowledge engineers would not automati- cally be alerted to this tendency: clinicians generally announce their probability judg- ments with a larger dose of confidence that is warranted (Fryback & Erdman, 1979; Lichtenstein & Fischoff, 1976).

The fact that medical specialists are not well calibrated may not be too serious for expert systems. It may be more crucial for the patients awaiting diagnosis if, as Fry- back and Erdman (1979) have suggested, this lack of calibration is manifested as a tendency to play it safe and overestimate the probability of disease. In order to offset this tendency psychologists have argued for a computing system that stores patient in- formation and calculates empirically based probabilities. If there is some way of trans- forming these probabilities into production rule certainty factors, then expert systems should be less reflective of experts’ biases and, on these grounds, they may become more accurate decision makers.

Knowledge engineers are sensitive to this idea. From their perspective they, too, un- derstand the utility of storing patient data so that the expert system can learn from examples and so can modify its own rules, refine its certainty factors, and so on (Duda & Gaschnig, 1981). The problem they face is that it is simply not feasible, as yet, to build an expert system that incorporates such a data base. First it is, in itself, an extremely challenging task. And second, the vast majority of expert systems are neither complete nor in commercial use. Consequently, all available time and effort is channeled into improving expert systems in their existing form, leaving very little of either commodity for the expansion of the concept of an expert system (Davis, 1982; 1984).

The foregoing arguments suggest that

while there are ways of improving the per- formance of expert systems in principle, in practice these strategies are either techno- logically infeasible or place unrealistic de- mands on busy professionals. Therefore, at least within the forseeable future, expert systems are unlikely to match the perform- ance of simple linear models for some de- cision problems and under most conditions. Is there, then, any case to be made for the construction of expert systems?

Although linear models are often supe- rior to expert systems, there are some spe- cialized contexts in which expert systems can hold their own. For example, in those medical subspecialties where decisions are typically made on the basis of objective (laboratory tests, unambiguous clinical signs) rather than subjective patient re- ports, expert systems can perform as well as highly trained specialists. In the absence of such specialists, however, expert systems are likely to have the advantage over an attending physician or family practitioner. Hence, they may be useful in smaller med- ical institutions which do not have consult- ants in the more esoteric subspecialties on staff.

Of course, even in these situations, linear models will probably do as well as expert systems. However, the latter may be pre- ferred over the former because, being based on particular individuals’ experience, their diagnostic decisions seem more “human.” Such a preference, if it actually exists, is understandable. It is also potentially dan- gerous. To the extent that an expert sys- tem’s decisions are “humanized” the dis- tance between the computer model and a physician user is minimized. Thus, when physicians rely on expert system diagnosis there is the concern that they will effec- tively learn to think like the expert system. In this case, they may not be overly sensi- tive to evidence that questions the provided diagnosis and, if so, they are unlikely to override the expert system’s diagnosis or treatment recommendations. And, in this case, diagnostic errors will most probably be machine errors.

When regression equation diagnosis is preferred over the expert system option, intuition suggests that it will be easier to

Behnworal Science, Volume 32, 1987

Page 16: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 289

override suspect diagnoses. Clinicians will often possess, or interpret, details of a pa- tient’s history that are not available to the computer but which render the computer’s diagnosis highly suspect (see, also, Dawes & Corrigan, 1974; Kleinmuntz, 1982). In addition, a patient’s overall history and pattern symptoms may be strongly remi- niscent of a past case. When this happens, Gilovich (1981) has suggested that clini- cians will be inclined to base their diagnosis on whatever pathology was diagnosed in that past case. If the computer diagnosis is at odds with this, then it may be overridden. To the extent that clinicians actually do override the mechanical diagnoses of a regression equation, the use of linear models will lead to an increase in diagnostic errors that are errors in human judgment. Perhaps in the end, the choice between complex expert systems and simple linear models, in contexts where they do equally well, is really a choice between living with the consequences of machine versus human error. And this choice is a matter of both taste and ethical and moral debate.

The preceding discussion of the simple versus complex paradox suggests a resolu- tion in favor of psychologists’ simple linear models. They can outperform expert sys- tems in all but nearly perfect decision-mak- ing contexts, and there seems little hope that technological advances in the near fu- ture will alter this ranking. Furthermore, even in those situations were expert sys- tems and linear equations are closely matched, it is not at all certain that the machine errors of expert systems will be more acceptable than the human errors that are likely to accompany linear equa- tions.

There is, however, one argument that could be used to challenge this resolution. When experts make decisions their task is essentially one of pattern recognition. For example, a patient presents a certain pat- tern of symptoms and a physician must decide whether or not that patient belongs to the class of people with stomach cancer. The problem of pattern recognition has received a great deal of attention in A1 over the past 20 years or so and there is a substantial body of literature in this area

which deals with pattern recognition algo- rithms (perceptrons) that are based on lin- ear regression equations (see, e.g., Nilsson, 1965; Sebestyen, 1962; Watanabe, 1985). In these algorithms Beta-weights are assigned to local features of patterns and can be modified in light of cases of misclassifica- tion. Thus, by considering dimensions of the decision-making problem to be local features and misclassification to be synon- ymous with a wrong decision, perceptions are very close to the kinds of decision- making algorithms proposed by psycholo- gists such as Dawes (1979).

Given this similarity, it can be argued that Minsky and Papert’s (1969) critique of perceptions also applies to psychologists’ simple decision-making models. In their book Minsky and Papert identify some limitations of perceptron algorithms. A de- tailed analysis of these limitations is be- yond the scope of this paper. However, in brief, they argue that perceptrons cannot recognize patterns when classification is based on global properties such as connect- edness or openness. In these cases percep- trons fail because there is no finite set of local properties such that a linear combi- nation of those properties defines member- ship in the class of, for example, objects that are connected. And, in these cases, pattern recognition, at least in part, de- pends on configural processing.

The types of expert decisions studied by psychologists are typically of this “global” type. For example, consider the class of people with stomach ulcers or the class of students who will be successful in graduate school. In both these cases the classifica- tion criterion is a global one. Hence, it could be argued that the linear models used by psychologists to approximate experts’ de- cision-making processes in such situations are not valid and that, by default, the con- struction of expert systems is justified.

There are at least two arguments against this conclusion. First, Minsky and Papert’s (1969) critique has not stopped work on perceptrons in Al. Although some, for ex- ample, Nilsson (1971), seem to have taken the criticism seriously, others, such as Wa- tanabe (1985), have been content to make minor modifications so that perceptron al-

Behavioral Science, Volume 32, 1987

Page 17: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

290 BARBARA

gorithms are “good enough” and percep- trons are still popular pattern recognition algorithms.

The second argument against the conclu- sion that the regression decision-making algorithms proposed by psychologists are invalid is that while Minsky and Papert’s (1969) conclusion is warranted, its appli- cability to the area of expert decision mak- ing may be limited. Minsky and Papert’s argument is a mathematical one and so tacitly assumes ideal conditions. Therefore the conclusion that linear models are not appropriate decision-making algorithms seems justified for decision problems that are virtually noise free. However, very often the decision problems studied by psychol- ogists are not of this type. In the case of graduate student selection and in many types of medical diagnosis, input informa- tion is noisy and/or has a stochastic rela- tionship to a valid decision. In these cases to adopt Minsky and Papert’s conclusions is to adopt configural decision-making methods. However, it is in just these cases that bootstrapping is most effective and linear regression algorithms are particu- larly robust with respect to configural strat- egies (see, e.g., Dawes, 1976, 1979; Dawes & Corrigan, 1974). Thus, in noisy decision- making contexts the empirical evidence suggests that Minsky and Papert’s critique is not applicable. The use of regression equations as a model of experts’ decision- making processes is therefore not invali- dated in such contexts and a resolution of the simple versus complex paradox which advocates the use of expert systems by de- fault is not justified.

The resolution of the complex versus simple paradox is not an acclamation of the merits of psychologists over knowledge en- gineers. Nor does it imply that linear models are a closer representation of ex- perts’ underlying decision-making proc- esses than expert systems. It simply means that linear models are extremely robust. Given the degree of error that exists in the real world, in both the nature of decision problem and the measurement of predictor and criterion variables, linear models sim- ply do a better predictive job than other algorithms.

CARROLL

REFERENCES Aikins, J. S. Prototypical knowledge for expert sys-

tems. Artificial Intelligence, 1983, 20, 163-210. Anderson, N. H. Looking for configurality in clinical

judgment. Psychological Bulletin, 1972, 78, 93- 102.

Bowman, E. H. Consistency and optimality in man- agement decision making. Management Sci- ence, 1983, 9, 310-321.

Brachman, R. J., et al. What are expert systems? In F. Hayes-Roth, D. A. Waterman & D. B. Lenat (Eds.) Building expert systems. Vol. 1. Reading, MA: Addison-Wesley Publishing Co., 1983,31- 5 7.

Brunswik, E. The conceptual framework ofpsychology. Chicago: University of Chicago Press, 1952.

Buchanan, B. G., et al. Constructing expert systems. In F. Hayes-Roth, D. A. Waterman, & D. B. Lenat (Eds.) Building expert systems. Vol. 1 . Reading, MA: Addison-Wesley Publishing Co.,

Buchanan, B. G. & Feigenbaum, E. A. DENDRAL and meta-DENDRAL: Their applications di- mension. Artificial Intelligence, 1978, 11(1), 5- 24.

Clancey, W. J. The epistemology of a rule-based expert system: A framework for explanation. Artificial Intelligence, 1983, 20, 215-251.

Connolly, T. Uncertainty, action and competence: Some alternatives to omniscience in complex problem-solving. In S. Fiddle (Ed.) Uncertainty: Social and behaoioural dimensions. New York: Praeger, 1980,69-91.

Davis, R. Expert systems: Where are we? And where do we go from here? The A1 Magazine, 1982, Spring, 3-22.

Davis, R. Amplifying expertise with expert systems. In P. H. Winston, & K. A. Prendergast (Eds.) The A1 business: The commercial uses of artifi- cial intelligence. Cambridge, MA: The MIT Press, 1984, 17-40.

Davis, R. & Lenat, D. B. Knowledge-based systems in artificial intelligence. New York: McGraw-Hill, 1982.

Dawes, R. M. A case study of graduate admissions: Applications of three principles of human de- cision making. American Psychologist, 1971, 26,

Dawes. R. M. Shallow psychology. In J. S. Carroll & J. W. Payne (Eds.) Cognition and social behav- ior. Hillsdale, Nd: Lawrence Erlbaum Associ- ates, 1974, 3-11.

Dawes, R. M. Predictive models as a guide to prefer- ence. IEEE Transactions on Systems, Man, and Cybernetics, 1977, 7(5), 355-357.

Dawes, R. M. The robust beauty of improper linear models in decision making. American Psychol-

Dawes, R. M. & Corrigan, B. Linear models in deci- sion-making. Psycholopica1 Bulletin, 1974,

1983, 127-167.

180-188.

ogist, 1979, 34(7), 571-582.

81(2), 95-166. Drevfus. H. L. What cornouters can’t do. New York:

“ Harper & Row, 19f2. Duda, R. O., e t al. Semantic network representations

Behavioral Science, Volume 32, 1987

Page 18: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

EXPERT SYSTEMS 29 1

in rule-based inference systems. In D. A. Wa- terman, & F. Hayes-Roth (Eds.) Pattern-di- rected inference systems. New York Academic Press, 1978, 203-221.

Duda, R. C. & Gaschnig, J . G. Knowledge-based expert systems come of age. Byte, September, 1981,

Ebbeson, E., & Konecni, V. Decision-making and in- formation integration in the courts: The setting of bail. Journal of Personality and Social Psy-

Einhorn, H. J. Decision errors and fallible judgment: Implications for social policy. In K. R. Ham- mond (Ed.) Judgment and decision in public policy formation. Denver, Colorado: Westview Press, 1978, 142-169.

Einhorn, H. J. Overconfidence in judgment. New Di- rections for Methodology of Social and Behav- ioral Science, 1980, 4, 1-16.

Einhorn, H. J . & Hogarth, R. M. Unit weighting schemes for decision making. Organizational Behavior and Human Performance, 1975, 13,

Einhorn, H. J . & Hogarth, R. M. Confidence in judg- ment: Persistence of the illusion of validity. Psychological Review, 1978,85(5), 374-416.

Fryback, D. G. & Erdman, H. Prospects for calibrating physicians’ probabilistic judgments: Design of a feedback system. IEEE, 1979, 1,340-345.

Gilovich, T. Seeing the past in the present: The effect of associations to familiar events on judgment and decisions. Journal of Personality and Social

Goldberg, L. R. The effectiveness of clinicians’judg- ments: The diagnosis of organic brain damage from Bender-Gestalt test. Journal of Consulting

Goldherg, L. R. Simple models or simple processes? Some research on clinical judgments. American

Goldherg, L. R. Man versus model of man: A rationale, plus some evidence, for a method of improving clinical inference. Psychological Bulletin, 1970, 73(6), 422-432.

Hammond, K. R. Probabilistic functioning and the clinical method. Psychological Review, 1955,62,

Hammond, K. R., Hursch, C. J. & Todd, F. J . Analyz- ing the components of clinical inference. Psy- chological Review 1964, 71, 438-456.

Hapgood, F. Computers: Experts to a point. Atlantic Monthly, February, 1985,W-92.

Hayes-Roth, F., Waterman, D. A. & Lenat, D. B. An overview of expert systems. In F. Hayes-Roth, D. A. Waterman, & D. B. Lenat (Eds.) Building expert systems. Vol. 1 . Reading, MA: Addison- Wesley Publishing Co., 1983, 3-29.

Hoffman, P. J . The paramorphic representation of clinical judgment. Psychological Bulletin, 1960,

238-281.

chology, 1975, 32, 805-821.

171-193.

Psychology, 1981, 40, 797-808.

Psychology, 1959, 23(1) 25-33.

Psychologist, 1968, 23(7), 483-496.

255-262.

57(2), ii6-i3i. Hoffman. P. J.. Slovic. P. & Rorer. L. G. An ANOVA

model for the assessment of configural cue uti- lization in clinical judgment. Psychological Bul- letin, 1968, 69, 338-349.

Hursch, C. J., Harnmond, K. R. & Hursch, d. Some

methodological issues in multiple-cue probabil- ity studies. Psychological Review. 1964, 71, 42- 60.

Jackson, P. Introduction to expert systems. Wor- kingham, England Addison-Wesley, 1986.

Johnson, L. & Keravnou, I. E. Expert system technol- ogy. New Hampshire, Abacus Press, 1985.

Johnson, P. E., Hassebrock, F., Duran, A. S. & Moller, J. H. Multi-method study of clinical judgment. Organizational Behavior and Human Perform- ance, 1982,30,201-230.

Kleinmuntz, B. The processing of clinical information by man and machine. In B. Kleinmuntz (Ed.) Formal representation of human judgment. New York: John Wiley, 1968, 149-186.

Kleinmuntz, B. The computer as clinician. American

Kleinmuntz, B. Computational and non-computa- tional clinical information processing by com- puter. Behavioral Science, 1982, 27, 164-175.

Kunz, J. C., et al. A physiological rule-based system for interpreting pulmonary function test results. Heuristic programming Project, Computer Sci- ence Dept., Stanford University, HPP-78-19 (Working paper), 1978.

Leal, A. & Pearl, J . An interactive programme for conversational elicitation of decision struc- tures. IEEE Transactions on Systems, Man and Cybernetics, 1977, 7(5), 368-376.

Lichtenstein, S. C. & Fischoff, B. Do those who know more also know more about what they don’t know? ORI Bulletin, 1976, 16( 1).

Meehl, P. E. Clinical us. statistical prediction. Minne- apolis: University of Minnesota Press, 1954.

Meehl, P. E. Seer over sign: The first good example. Journal of Experimental Research in Personal- ity, 1965, 1, 27-32.

Meehl, P. E. Psychodiagnosis: Selected papers. New York: W. W. Norton & Co., 1973.

Minsky, M. & Papert, S. Perceptrons: An introduction to computational geometry. Cambridge, MA: MIT Press, 1969.

Naylor, C. M. Build your own expert system. Cheshire, U. K.: Sigma Technical Press, 1983.

Nilsson, N. J . Learning machines: Foundations of trainablepattern-classifying systems. New York: McGraw-Hill, 1965.

Nilsson, N. J . Problem solving methods in artificial intelligence. New york: McGraw-Hill, 1971.

Nisbet, R. H. & Wilson, T . D. Telling more than we can know: Verbal reports on mental processes. Psychological Review, 1977,84(3), 114-124.

Oskamp, S. Overconfidence in case-study judgments. Journal of Consulting Psychology, 1965, 29(3), 261-265.

Phelps, R. H. & Shanteau, J. Livestock judges: How much information can an expert use? Organi- zational Behavior and Human Performance,

Politakis, P. & Weiss, S. M. Using empirical analysis to refine expert system knowledge-bases. Arti- ficial Intelligence, 1984, 22, 23-48.

Pople, H. E. CADUCEUS: An experimental expert system for medical diagnosis. In P. H. Winston, & K. A. Prendergast (Eds.) The A1 business:

Psychologist, 1975, 30, 379-386.

1978, 21, 209-219.

Behavioral Science, Volume 32, 1987

Page 19: Artificial intelligence expert systems for clinical diagnosis: Are they worth the effort?

292 BARBARA The commercial uses of artificial intelligence. Cambridge, MA: The MIT Press, 1984,67-80.

Rorer, L. G. & Slovic, P. The measurement of changes in judgmental strategy. American Psychologist,

Sebestyen, G. S. Decision making processes in pattern recognition. New York: The Macmillan Co., 1962.

Shortliffe, E. H. Computer-based medical consulta- tions: MYCIN. New York: Elsevier, 1976.

Slovic, P. Analyzing the expert judge: A descriptive study of a stockbroker’s decision processes. Journal of Applied Psychology, 1969,53(4), 255- 263.

Slovic, P., Fischoff, B. & Lichtenstein, S. Behavioral decision theory. Annual Review of Psychology,

van Melle, W. J. System aids in constructing consul-

1966, 21,641-642.

1977, 28, 1-39.

CARROLL

tation programs. Ann Arbor, MI: UMI Research Press, 1981.

Watanabe, S. Pattern recognition: Human and me- chanical. New York: John Wiley & Sons, 1985.

Winston, P. H. & Prendergast, K. A. (Eds.) The A1 business: The commercial uses of artificial intel- ligence. Cambridge, MA: MIT Press, 1984.

Yntema, D. B. & Torgerson, W. S. Man-computer cooperation in decisions requiring common sense. IRE Transactions of the Professional Group on Human Factors in Electronics, 1961, 20) .

Yu, V. L. et al. Antimicrobial selection by a computer. A blinded evaluation by infectious disease ex- perts. Journal of the American Medical Associ- ation, 1979, 242(12) 1279-1282.

(Manuscript received July 1, 1985.)

Behavioral Science, Volume 32. 19R7