strategies for building propositional expert systems

Strategies for Building Propositional Expert Systems Robert M. Colomb' and Charles Y. C. Chung CSlRO Division of Information Technology, Box 1599, North Ryde NSW 21 13, Australia

The core of this article is a proof that stratified Horn clause propositional systems are equivalent to and can be efficiently transformed into decision tables by a process closely related to assumption-based truth maintenance. The transformed systems execute much faster and in a bounded time, leading to the possibility of executing real-time expert systems in microseconds on fine-grained parallel computers. One consequence is to simplify the consistency and completeness analysis for such systems, in particular the problem of ambiguity. A deeper consequence is that it makes sense to view these systems as stochastic processes. This, and an analysis of the problem of maintenance of these systems, leads to the conclusion that by and large rule induction approaches are better than rule construction approaches for building them. 0 1995 John Wiley & Sons, Inc.

I. INTRODUCTION

Expert systems which rely on the propositional calculus are very common, and the standard implementations are very computationally expensive, both in time and memory. For example, the COLOSSUS' system has 6500 rules and runs on a very large mainframe computer. It requires 20 megabytes of real memory per user, and has a response time measured in minutes.

The core of this article is a proof that a propositional expert system can be mechanically transformed into a decision table, using a tractable algorithm which is a specialization of a number of results from deductive database theory, and which is closely related to assumption-based truth maintenance. Decision tables are very simple computational structures which can be executed very quickly and require little memory. With very simple hardware assist, it is possible to build systems with hundreds of rules with execution times of afew tens of microseconds on fine-grained parallel computers, which could greatly expand the useful domain of expert system technology, especially in real-time applications.

*Author to whom correspondence should be addressed. Present address: Depart- ment of Computer Science, The University of Queensland, Queensland, Australia 4072. Fax: +61 7 365 1999, Email: [email protected]

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 10,295-328 (1995) 0 1995 John Wiley & Sons, Inc. CCC 0884-8 173/95/030295-34

296 COLOMB AND CHUNG

The decision table representation greatly simplifies the problem of checking rules for completeness and consistency. In particular, it makes convenient a systematic analysis of an expert system for situations where more than one conclusion is reached, and suggests a simple and systematic method for reducing such ambiguity.

More deeply, the equivalence between propositional expert systems and decision tables suggests that the real-world system modeled by an expert system may be profitably viewed as a stochastic process. The expert system model can be constructed in a number of ways, broadly by rule construction or rule induction. The stochastic process view helps to understand the relationships between them, and shows that inductive techniques are preferred in most cases.

It also becomes clear that either of the methods of constructing systems are equivalent to case-based reasoning, which leads to a proposed method for building an expert system which is a generalization of the expert-assisted inductive method called ripple-down rules.

The methods are grounded throughout by an extended analysis of the Garvan ES1 thyroid assay ~ y s t e m , * ~ ~ which was in routine use between 1984 and 1990, and is presently being re-implemented using methods related to those in this article. The system was used in a pathology laboratory to make clinical interpretations of thyroid hormone function, about 10 000 cases per year. It has about 700 rules.

The article begins by proving that propositional expert systems can be transformed into decision tables, giving first a simple general proof, then a practicable algorithm derived from deductive database theory. This algorithm is first presented for a special case, then generalized to a wide variety of situations encountered in practice. Section I11 explores the relationship between the transformation algorithm and assumption-based truth maintenance, with applications to expert systems which must operate in real time. The ATMS relationship allows cyclic systems, and also incremental knowledge base maintenance. Apractical problem in expert systems is the presence of ambiguity, which occurs if an input can give rise to more than one output. Section IV shows that the decision table representation of a expert system makes ambiguity much easier to discover and presents algorithms which assist in reducing ambiguity. Section V notes that the decision table representation allows us to integrate a number of acproaches to building expert systems, in particular approaches based on rule induction with approaches based on rule construction, and shows that rule induction is a more powerful technique. Section VI sketches an expert system shell based on the ideas in the article, and Sec. VII presents conclusions.

11. PROPOSITIONAL HORN CLAUSES TO DECISION TABLES

A. Definitions

A propositional expert system is a set of propositions in Horn clause form. Each clause consists of an antecedent, which is a conjunction of literals, and a consequent, consisting of a single elementary proposition. A literal is an

PROPOSITIONAL EXPERT SYSTEMS 297

elementary proposition or its negation. A literal is positive if it is an elementary proposition, and negative if the negation of an elementary proposition. A clause will be called a rule in the following. Note that a proposition with disjunctions in its antecedent or conjunctions in its consequent can be easily transformed into Horn clause form.

Elementary propositions can be classified into three mutually exclusive groups: facts, which appear only in antecedents; conclusions, which appear only as consequents; and assertions, which appear both in antecedents and as consequents. An expert system will be identified as its set of rules.

It is convenient to consider a fact as derived from an assignment to a variable of one of a small number (greater than one) of possible values. We thus obtain a set X of variables xi each of which has a set of possible values V j , and every fact is a proposition of the form xi = v for u in Vi. In this formulation, the set of facts derived from a variable has exactly one member with the value true, and all of the facts in the set are said to be determined. If the value of a variable is unknown, then all of the facts derived from that variable are said to be undetermined. An input Z is a conjunction of determined facts. The input is incomplete if there are some facts which are undetermined.

In applying an expert system R to an input I , we obtain a proposition P consisting of the conjunction of I with the conjunction of the rules in R (expressed as clauses), and consider the conclusions C. A conclusion is determined by the input if it is a logical consequence of (entailed by) P.

A decision table is a table with one column for each variable and an additional column for an action. Each row of the table contains a set of values of each variable and an associated action. A cell containing all values of the associated variable is equivalent to a don’t care condition. A decision table is executed by presenting it with a row of values, one for each variable. If the input value is a subset of the values in a row for all variables, then the action associated with that row is executed. This is called the row jring. More than one row can fire for a given input.

B. Equivalence Between Propositional Expert Systems and Decision Tables

THEOREM. A propositional expert system is equivalent to a decision table.

Proof. A decision table is a propositional expert system. A cell is a disjunction of propositions, each of which is a fact. A row is a conjunction of propositions, one for each cell. If the row is transformed to disjunctive normal form, it becomes a disjunction of conjunctions of facts, each of which is the antecedent of a rule. The action of a row is a consequent. A row is therefore equivalent to a set of rules all with the same consequent. The decision table is a collection of rows, therefore a collection of rules, and a collection of rules is a propositional expert system by definition. Note that the propositional system resulting from a decision table has only facts and conclusions, but no assertions. Such a system will be called a j a t expert system and a flat expert system is clearly equivalent to a decision table.

COLOMB AND CHUNG

A propositional expert system is a decision table. Consider an expert system R with an input I . Associate with 1 the subset of conclusions entailed by the propositional system P given by the conjunction of I with the disjunction of rules in R. Call this subset R(Z). R(Z) can always be computed since the propositional calculus is decidable. R can thereby be seen as a function mapping the set of inputs into the set of subsets of conclusions. The number of possible inputs is finite. The function can in principle be expressed in an extensional form by writing down each possible input I as a row in a table and appending to it as its action the subset of conclusions R(Z). This gives a truth table representation of the propositional system, which is by definition a decision table.

This proof is constructive, but unfortunately does not lead to a practicable algorithm, since the number of inputs is exponential in the number of variables. It is, however, general. In particular, it is independent of the inference engine used and of the details of any rules involving assertions.

C. A Practicable Transformation Algorithm

I . Basic Algorithm

This section presents a practicable transformation algorithm, first by making some restrictive assumptions. It is then shown how the algorithm can be modified to remove many of the restrictions. The restrictive assumptions are:

the system is acyclic, 0 the system uses a forward chaining inference engine, 0 no assertion is negated in either an antecedent or consequent of a rule.

We consider the rules ri in the expert system R as nodes in a graph, designated the clause inference graph. An arc is drawn between ri and rj if there is an assertion a which appears as the consequent of ri and in the antecedent of rj. We assume that this graph is acyclic: no proposition is a consequent of a rule of which it is an antecedent, or more generally, no proposition is a consequent of a rule which has an antecedent necessarily logically dependent on it. The clause inference graph is essentially the dependency graph of Ref. 4. If the clause inference graph has arcs labeled with the assertions, we can construct its dual, having the nodes as assertions and an arc from assertion ai to assertion aj if there is a rule having a, as one of its antecedents and a. as its consequent. This dual is the dependency graph for strat$ed programs.

We partition the set of rules into B , those all of whose antecedents are facts; K , those whose consequents are conclusions; and M , the others. (B and K are assumed disjoint. Any rules in B intersect K are flat by definition, so can be removed and added to the flat expert system produced by the algorithm after its completion.) A node r can be labeled by the maximum length of a path

5


- - -

1

2

K

Figure 1. Graphical representation of rules labeled by distance from base.

between it and a member of B , as shown in Figure 1. B is the set of nodes with label 0, and M can be partitioned into M I , M2, etc., each Mi consisting of the nodes in M with label i. The largest label of any node in K is defined as the maximum depth of reasoning of the system R . As shown in the figure, we also label each arc with the assertion from which the arc is derived and each assertion with the maximum label of a node with that assertion as a consequent. This labeling can be done in the course of verifying that the graph has no cycles using an algorithm successively removing nodes with no input arcs.

LEMMA 1. Every assertion in an antecedent of a rule of label i must have a label less than i. Proof follows directly from the dejinitions.

LEMMA 2. For every label up to the maximum depth of reasoning, there is at least one rule with that label.

Proof. If a level i were missing, then in the acyclic verification process when step i were reached there would be no nodes without input arcs and the graph would therefore be cyclic.

This labeling produces a stratification of the program in the sense of Apt et al.’ The set of rules with the same label will be designated the stratum with that label as its index.


The algorithm proceeds by replacing each assertion a in the antecedents of rules in stratum i with the disjunction of the antecedents of rules in strata less than i of which assertion a is a consequent, beginning with stratum 1 . This process in effect collapses the rules onto the conclusions by successively replacing intermediate assertions with expressions which imply them. The resulting propositions have only facts in their antecedents and only conclusions as consequents, can be expressed in disjunctive normal form, thus form a flat expert system, and therefore are equivalent to a decision table.

Our algorithm can be seen as a specialization of results in the theory of deductive databases. Each step is an application of a specialization to the propositional case of a deductive database technique called unfolding .6,7 What we refer to as collapsing one stratum into another is the unfolding of the each of the rules in the second stratum with respect to the rules in the first. The algorithm as a whole can be seen as a modification of a specialization to the propositional case of an algorithm given by Reiter.’ Reiter’s algorithm uses a resolution-based theorem prover to reduce a deductive database to a conjunction of clauses all of which are extensional database queries. In deductive databases, the extensional database is the set of facts, or ground unit clauses. The unfolding employed in our algorithm is in the propositional case the same as the disjunction of the resolution of all the alternatives, and the extensional database in our context is the set of facts. Reiter’s algorithm applies to general clausal systems, so that our result can be extended to non-Horn propositional systems.

Finally, it should be noted that the decision table resulting from the transformation is equivalent to the Horn clause system in the sense that all conclusions derivable from one are derivable from the other. The Horn clause system has in addition the set of assertions which are not present in the decision table representation. This issue is discussed in Sec. 11-D.

Example. R is the set of rules

y,. f l & f 2 -+ a, r,. a, & a2 + c2

r3.

Facts are (fl, f 2 , f,, f4 , fs> B is {Y,, r2) (stratum 0) Assertions are {a,, az} M is {r5} (stratum 1) Conclusions are {c,, c2} K is {r3 , r,} (stratum 2 )

Maximum depth of reasoning is 2 .

r2. f 3 & f 4 -+ a2 r5. fs 82 a , + a2 f l 8c a2 + c,

Step I : Collapse stratum 0 into stratum 1.

Rule r5 becomes

Step 2: Collapse strata 0 and 1 into stratum 2 .


2. Cyclic Systems

The algorithm above constructs derivations for assertions whose antecedents consist solely of facts. A cycle in the Horn clause system is a group of rules each having an assertion as a consequent. Some of these assertions may have alternative derivations consisting solely of facts.

Where none of the assertions has an alternative derivation consisting solely of facts, a cycle contributes nothing to resolution-based theorem proving. A cycle is a path through the clause inference graph, and can be represented as a conjunction of clauses:

( P I + -a1 + -PJ & (Pz + -a2 + - P I ) & * * * & (p, + -an + -pa-]) (1)

By resolving successive pairs of clauses on the propositions p l , . . . , pn- 1, (1) becomes

(-a1 + -pn + -a* + . . . + p , + -an) (2)

p , + -pn = true ( 3 )

In first-order systems, cycles occur from recursion. In propositional systems, cycles stem from indeterminancy . For example, consider the cyclic system

which is a tautology, since

a & p - , q , b & q - - + p (4)

which can be simplified to the sum of its prime implicants

-a + -b + p&q + -p&-q

so that if both a and b are true, we can conclude that p = q . Where at least one of the assertions in a cycle has an alternative derivation

from facts, then these equivalences may lead to derivations of other assertions in the cycle from facts. In ( l ) , if all of the ai are derived from facts, then if say p , has a derivation from facts, this propagates top2, . . . , p n . It also propagates back to p l , but the derivation from around the cycle is subsumed by the original derivation for p I coming from outside the cycle.


It should be apparent that the algorithm can be modified to take cycles into account. However, the introduction of cycles clutters the notion of levels and the labeling scheme. In Sec. 111, it is noted that the above algorithm is closely related to assumption-based truth maintenance. The ATMS algorithm is incremental, so that it is much easier to see the effect of a cyclic system. The adaptation to cyclic systems will therefore be deferred until Sec. 111.

3. Analysis of Algorithm

The algorithm for conversion of an acyclic propositional Horn clause system into a decision table can be divided into two parts: construction of a labeled clause inference graph, and production of the decision table from the labeled graph. We begin with a table of rules. An auxiliary data structure is required: an alternatives count, which will contain for each assertion the number of rules having that assertion as consequent. The dimensions of the problem will be:

r the number of rules a d

the average number of assertions per rule the maximum depth of reasoning

To construct the clause inference graph, we first count the number of alternatives for each assertion. This requires one step per rule, therefore is O(r). Facts will be taken as having zero alternatives. We then proceed to identify the rules with label I , which are those rules all of whose antecedents have zero alternatives. When a rule is labeled, we decrement the number of alternatives for its consequent assertion, and also record a pointer to the rule in a data structure associated with the assertion. This step requires examination of each antecedent in each rule, and is therefore O(ar). At the end of the step, additional assertions have a zero alternative count (follows from Lemma 2). The graph can be completely constructed and labeled in one step for each possible label, bounded by the maximum depth of reasoning. Construction of the labeled dependency graph is therefore O(ard).

Production of the decision table is done by repeated traversal of sections of the clause inference graph. The cost of a single traversal is highly dependent on the details of data representation, but requires at most examination of each antecedent in each rule, therefore is at most O(ar). One traversal is required for each row in the resulting decision table. It is therefore necessary to estimate the number of rows.

There will certainly be one row in the decision table for each rule whose consequent is a conclusion. Additional rows will arise from alternative paths for satisfying the antecedents of one of these terminal rules. An alternative arises if one of its antecedents is the consequent of more than one rule. It follows that the number of rows is equal to the number of terminal rules if the alternative count is 1 for each assertion, and that the number of rows increases in a multiplicative way as the alternatives counts become greater than 1 .


The problem is the same as converting an arbitrary boolean expression into disjunctive normal form. For example

(a + b & (c + d)) & (e +fl converts to

a & e + b & c & e + b & d & e +

a & f + b & c & f + b & d & f

We can compute the number of rows during the construction and labeling of the clause inference graph. We need an additional data structure full alternative count cumulating alternative count, having one entry for each assertion, and also for each conclusion. For a particular assertion, alternative count is the number of rules having that assertion as consequent, while full alternative count will be the number of disjuncts in the disjunctive normal form expression which implies that assertion starting from facts only. A fact has full alternative count of 1.

When a rule is labeled, the full alternative count associated with its consequent is increased by the product of the full alternative counts of its antecedents. The total number of rows in the decision table is the total of the full alternative counts of all the conclusions.

If N is the number of rows, then the production of the decision table at most O(Nar). This step will tend to dominate the computation time.

The practicability of the method is therefore dependent on the number of rows in the decision table produced. We describe below that the test case used to validate the method produced a reasonable number of rows, and also discuss how an explosion in the number of rows stems from poor software engineering practice. The algorithm described in this section has the property that the number of rows can be computed at low cost, so that unfavorable situations can be easily detected.

4 . Backward Chaining

The basic algorithm given above is based on a forward chaining inference engine, which starts with known facts and derives conclusions as the antecedents of rules become known. If the rules labeled i are identified as stratum i, this corresponds to a traversal of the dependency graph starting from layer 1. This traversal will tend to be stratum by stratum, necessarily so if negation by failure is employed. We will call the labels and strata obtained in the forward chaining approach as forward labels and strata, respectively. Note that forward chaining is exactly the semi-naive algorithm used to compute the perfect model in the deductive database l i t e r a t~ re ,~ and that the perfect model for a propositional deductive database is exactly the assertions and conclusions which are entailed by the input and the rules.

An alternative way to inference is to start from the rules with conclusions as consequents and work backwards, called backward chaining. If the inference


engine is backward chaining, we note that a graph can be verified acyclic by successively removing nodes with no output arc. The clause inference graph can therefore be labeled with minimum distance from K rather than with maximum distance from B , and the algorithm modified accordingly. The maximum depth of reasoning d is clearly unchanged. These labels will be called backward labels.

Note that if a node has backward label i it must have forward label less than d - i. This follows from the observation that step d of the backward algorithm removes nodes with no input arc. The backward collapse therefore has the same result as the forward collapse since it can be performed by the forward collapse algorithm on the graph with the backward labeling.

Forward and backward chaining have exactly the same result, obtained in the forward chaining case by collapsing the clause inference graph from the facts onto the conclusions through the assertions. In the backward chaining case, the assertions are subgoals, and the algorithm collapses the graph from the conclusions onto the facts through the subgoals. The two strategies can be viewed as alternative ways of constructing the function mapping the set of assignments into the set of subsets of conclusions.

5 . Negated Assertions

In constructing the clause inference graph, we follow the stratification principles of Apt et al.’ by placing a rule with a negated assertion in its antecedent after all strata containing rules with that assertion as consequent. Since no rule has a negated assertion as a consequent, the inferencing must rely on stratified negation-as-failure .

Let r be such a rule and let i be the maximum forward label of any negated assertion in its antecedent. Rule r is labeled with the maximum of i and the label of any unnegated assertion in its antecedent. When rule Y is reached in the forward collapse algorithm, any negated assertion is replaced by the negation of the expression implying that assertion.

Although Horn clause systems cannot entail negative literals, in the propositional case if we wish to prove a negative literal we can give it a positive name, and add the integrity constraint that exactly one of the two propositions must be true. In a system where negated assertions are allowed as consequents, negation-as-failure is not needed. An assertion and its negation are in most respects separate propositions and can be treated as such, with two exceptions. First, the negation of an assertion cannot label any arc leading from the base set to a rule for which that assertion is a consequent. (Otherwise the system is not stratified and hence inconsistent.) Second, any input which would imply both the assertion and its negation is forbidden. The algorithm in its course identifies an expression in facts which implies each proposition. If we have for assertion a

E, + a ; E2 + -a

then a valid input must be consistent with


6 . Test Case

The algorithm has been successfully applied to the Garvan ESI thyroid assay system, which has 661 rules, and an average depth of reasoning of about 4. Some of the rules have negated assertions in their premises, but no rule asserts a negation. The system normally ran on a PDP-I1 with a specialized inference engine. For purposes of comparison, it was translated into OPS-5 and run on a Microvax 11, where it operates at about 18 rule firings per second, taking about 220 milliseconds to generate a conclusion. It was transformed into a decision table with 5286 rows. There are 34 variables with a total of 93 possible values, so the decision table requires 5286 x 93 bits, or about 62k bytes of memory.

There are a number of ways to process a decision table. There are standard algorithms to convert decision tables to procedural programs.'O Under certain conditions, the table can be converted into an efficient decision tree using methods like ID3.l' (This approach requires that the table be unambiguous, discussed below.) A balanced decision tree with N leaves identifies a particular leaf in log,(N) decisions, so that a table with 4096 rows would be computed in 12 decisions, each of which is a simple i f . . , then statement. This approach would clearly be extremely fast on standard hardware.

Execution results presented here are from a method using an inexpensive fine-grained parallel computer12 acting as a co-processor on a Sun 3/160. It is capable of processing a decision table at a rate of about 100 million bits per second, and can compute a decision from the transformed Garvan ES1 in about 2 milliseconds. The processor used has a programming model similar to the MasPar, Distributed Array Processor, and the Connection Machine, all of which are commercially available fine-grained parallel machines. The MasPar, for example, would be able to execute the system in about 20 microseconds.

We can conclude from this that it is possible to transform a general propositional expert system into a form that is capable of execution in a time sufficiently short that it opens many possibilities for the use of expert systems in real time applications.

D. Generalizations

1 . Explanation Capability

An important feature of expert systems is the ability to explain a conclusion or the reasons for asking a particular question. Most approaches to explanation follow the chain of assertions between the base facts and the conclusion, so are derived from the trace of the traversal of the clause inference graph.

In practice, many expert systems do not use explanations in their normal execution. (Garvan ES1, for example, is a batch program.) Jansen and Comp- ton13 make a strong case for the separation of the normal execution environment where explanations are not available from a maintenance environment where a very complete explanation environment is provided.


In any case, it is possible to adapt the main results to give an efficient computation structure which permits a complete explanation capability. Rather than a decision table, this approach relies on executing the rules in sequence in a single pass.

Recall that the algorithm labels each rule with the maximum number of inference steps between it and the base facts. From Lemma 1, all antecedents of rules in stratum i are determined by rules in strata less than i. In addition, rules in the same stratum can be evaluated in any order. Clearly, if the rules are sorted by stratum, it is possible to execute them in a single pass. It is only necessary to keep a table of the value of each of the assertions (initialized to false if negation-as-failure is used) which is updated by any rule firing whose consequent makes that assertion. Since no assertion found in the antecedent of a rule can be changed by any subsequent rule, a complete explanation capability is available. For example, if the question is "why did a particular rule not fire?", the table of assertions will contain the values of all the antecedents of that rule at the time it was considered. A further explanation can be obtained in a similar way by examining the rules in earlier layers which have a particular assertion as consequent.

Note that this approach is essentially the propositional specialization of the deductive database construction of the perfect model using the semi-naive algorithm .9

One way to represent this system is as a decision table with one row per rule and one column for each possible value of each fact augmented by one column for each possible value for each assertion. The Garvan ES1 system can be represented as a decision table with 661 rows. There are 52 assertions, so 104 columns are needed besides the 93 required for the 34 fact variables, so that 661 X (104 + 93) bits or about 16k bytes are needed for its storage, considerably less than the 62k bytes needed for the fully expanded decision table.

The Knowledge Dictionary13 has been re-implemented with an inference engine employing the method of this ~ec t i0n . l~ Note that the Garvan ES1 inference engine3 takes essentially this approach to get fast execution on a PDP- 1 1 with limited memory. Their approach can now be seen to be quite general.

A further approach to explanations is given in Sec. VI, which sketches an expert system shell based on the results from this research.

2. Expensive Facts

The previous results make the implicit assumption that all facts are available at the beginning of inference, and all facts have equal cost. In practice, some facts may have a higher cost, perhaps because they require database access or questioning the user. In this case, it is usual to first make use of the inexpensive facts available at the beginning, obtaining the expensive facts only if necessary. There will usually be rules whose consequent is not an assertion, but acommand to assign values to a group of variables.

The set of facts can be labeled in the same way as the assertions, with


facts available immediately labeled zero. In the decision table representation, the column headings can be sorted in increasing order of label. If the table is processed left to right, by the time a column labeled one or more is reached, the conditions under which that column is needed would be able to be evaluated. In the single-pass representation, the rule whose consequent is to obtain these facts will be in its correct place in the sequence.

Note that in this case the choice of forward or backward chaining affects the sequence in which expensive facts are obtained, since the clause inference graph is traversed in a different order. This order is preserved in the sequence of column headings in the resulting decision table or in the sequence of rules in the single pass version.

3. Inexact Reasoning

Some expert systems use one or another form of inexact reasoning. The result can be adapted to this situation, although insufficient research has been conducted to determine the practicability of the method.

First, an uncertainty measure can be appended to each proposition. An assignment of values to variables would also assign an uncertainty measure. The subset of conclusions would also have uncertainty measures. The main theorem still holds.

Second, in the forward chaining algorithm, the uncertainty measure can be propagated as a tree of function composition. For example, if u(x) is the uncertainty of proposition x, we might have

a & b-, c

c & d + e u(c) =f(u(a), u(b))

u(e) =f(u(c), u(d))

then we would have

u(e> = f ( f ( u ( 4 , u(b)), u ( 4 )

If the uncertainty propagation function is associative, it is not necessary to record the tree of inferences by which the assertions are eliminated, and the uncertainty of a conclusion can be computed directly from the uncertainties of the base facts in its antecedent.

In particular, the commonly employed Bayesian measure of uncertainty is a priori independent of the intermediate assertions, since the joint probability of conclusions and base facts is known in principle independently of the reasoning system.

4 . Inheritance

A simple form of inheritance can be subsumed into the formalism. Consider a type hierarchy consisting of a set of types T = {ti} and a partial

ordering > such that if ti > tj then ti is a subtype of ti. Let B be the subset of


T such that if t in B then there is no t i in T where t > ti. B is the set of types with no subtypes, and may be considered a set of base types.

Associated with each ti is a proposition p i which may be considered a set of properties. For each t in B we define a proposition which is the conjunction of properties inherited from its supertypes

q(t) = p ( t ) & (u 1 u in T , u > t ) p(u) .

It is assumed that for each t , -(q(t) +false), i.e., the inherited properties are consistent.

Let o be an object. Associated with o is a proposition p(o) which may be considered properties of the object. Also associated with o is a predicate isa : B + {true, false}. Let Z(o) be the inverse image of true. I(o) is the possibly empty set of base types of which o is an instance.

There needs to be some method of resolving conflicts between p ( o ) and q(ti) [and q(tj) if the object is of more than one base type]. We will assume a selection procedure such that if type ti is selected before type tj then the properties derived from t i override the properties derived from tj. The properties p ( o ) override those derived from types.

For a proposition p , let X ( p ) be the set of variables which are assigned definite values in a fact in p . We now build a proposition q*(o) by the following procedure:

n: =card(Z(o)) qlxo) = for i = 1 to n

conjunction of facts x i in q(ti) 1 xi not in X(qT-,(o)) qT(o) = qT-l(o)

q*(o) = 4x0)

Therefore each object o has a set of properties which are represented by the proposition q*(o). This set of properties can be used as the input I for an expert system.

111. COMPILED TRUTH MAINTENANCE AND REAL TIME

A. Relationship with ATMS

Assumption-based truth maintenance (ATMS)I5 has been used by a number of researchers as an architecture for real-time expert ~ystems.'~, ' ' There is a deep relationship between ATMS and the results of this article.

An ATMS sits outside of a problem-solving system, keeping track of which conclusions can be believed in the current state of knowledge. The ATMS manages a set of nodes, which in the language of this article are the elementary propositions. Associated with each node is ajustijication, which is a set of environments in which that node can be believed. An environment consists of a set of mutually consistent propositions drawn from the set of facts. A fact may be an assumption, which is a node true in the environment consisting of


itself, or a premise, which needs no justification. The set must as well be consistent with all of a set of nogoods, each of which is the negation of a conjunction of propositions. A node is believed if its justification is not empty. The ATMS is designed to dynamically accept assumptions, premises, nogoods, and Horn clause rules which define justifications for assertions and conclusions.

An expert system is a special case of the class of problem addressed by ATMS. In representing an expert system in an ATMS, the set of assumptions (the possible facts) is fixed, as is the set of justifications (the rules). The ATMS is an incremental system, and can accept justifications and assumptions in any convenient order. In particular, they can all be presented before any input from the external world. There may as well be a set of nogoods, which describe mutually exclusive sets of assumptions. (For example, there may be three propositions V < 0; V = 0; V > 0, no two of which can be true at the same time.) These nogoods are fixed in the expert system application, and can also be presented to the ATMS before any input from the external world. When all of the assumptions, justifications, and nogoods have been presented to the ATMS, the only thing remaining is to present the real-world inputs. Note that this first phase, which we will call the setup phase, occurs before the expert system is deployed. The inputs will be presented during the execution phase, which can occur an indefinite number of times after the system is deployed.

In the terminology used in this article, an input from the external world is an assignment of values to variables. This assignment produces premises in the ATMS language, which can be presented to the ATMS in any order. A premise replaces its corresponding assumption, and can cause removal of other assumptions which are members of the same nogoods as the premise. Removal of an assumption can be performed by introducing an additional nogood consisting only of that assumption. For example, given a fact f the ATMS would already have a nogood cf, -a. If the input -f is a premise, then df> is an additional nogood and the assumption -f would be replaced with the new premise. An additional nogood imposes additional constraints on environments, so may cause justifications for some nodes to become empty. Since in an expert system application no additional justifications are presented once the premises begin to be considered, the consideration of inputs progressively reduces the nodes which can be believed. Finally, when all the inputs have been presented, the nodes which can still be believed are the conclusions of the expert system. More precisely, the nodes which are still believed are the conclusions of the system which are consistent with the remaining assumptions. In an expert system application, one would normally expect that the set of assumptions remaining would be empty, or at least that all conclusions believed have at least one environment consisting entirely of premises. In other words, any conclusion is based on inputs from the external world.

In an acyclic Horn clause expert system, the setup phase of the ATMS is exactly the same process as the forward traversal of the clause inference graph to produce the decision table. A row of the decision table is the conjunction of assumptions forming one of the environments in the conclusion’s justification. The set of rows having as action a particular conclusion is therefore the justifica-


tion of that conclusion. The transformation of an expert system into a decision table can thus be seen as a sort of compiled ATMS. This enables us to use the ATMS algorithm to perform the transformation. In practice, the Garvan ES1 system consisting for this purpose of 95 assumptions, 749 rules, and 11 1 nogood sets, is compiled in 60 sec by an implementation of the ATMS algorithm written in C on an Apollo DN 10000. (The number of rules differs from the 661 mentioned elsewhere in the article, but the two sets of rules are equivalent. The difference comes from a different treatment of variables which could take more than two possible values .)

If the Horn clause system is cyclic, then when a cyclic set of informants is added the ATMS will update the justifications of any consequent node in the cycle and as well any node depending on them. An informant in the cycle can cause the update of justifications only if there are justifications for all of its antecedents, which are derived from informants outside the loop. If one such informant exists, then the entire cycle of informants may be activated. The process will stop when the first informant is reached, since any new environment derived from the cycle will be subsumed by an environment from outside the cycle contained in the justification which started the process.

Finally, the ATMS representation gives us a simple way to implement updates to the knowledge base. Recall that the setup phase is executed before any external input is presented to the system. Since the ATMS formalism is incremental, any new rules are simply new informants. Deletions of rules require deletion of informants. Deletions of informants are part of the ATMS formalism,'s but are discouraged because they can have complex cascaded effects. In our use of ATMS, deletion of a rule would be done as it were at compile time, so that very rapid response is not an issue. Our approach therefore supports maintenance of a rule base. See also Secs. V and VI for discussion of expert system maintenance.

We have shown that the transformation of a set of propositional Horn clauses into a decision table is a special case of the problem addressed by an ATMS, and in particular have gained an algorithm to perfQrm the transformation in cases where the set of Horn clauses is cyclic, and an incremental maintenance environment.

B. Advantages for Real Time

A real-time application is one in which the computer system interacts with a real-world process, and which must conform to the dynamics of that process. An important aspect is that the system may be required to respond to an input within a designated time interval. This time interval may be very short. This constraint places a premium on algorithms which can be guaranteed to complete in a bounded number of steps. This enables a choice of an implementation platform which will give the desired response time. It is also an advantage if the algorithm has a maximum response time not much more than the average response time, since the utilization of the implementation platform increases.


Finally, the fewer the number of steps, the less severe the technology require- ments on the implementation platform.

The decision table representation of a set of rules satisfies all of these criteria. It has been shown above that a decision table is capable of fast implementation. The computation time is also bounded. If the decision table is converted into a decision tree, the maximum number of decisions is known. If the tree is balanced, the number of decisions for each conclusion is about the same. On the other hand, if the decision table is processed directly by a fine- grained parallel processor, the maximum number of column operations needed is known, and the number of column operations needed to establish a particular conclusion depends entirely on the number of measurements relevant to that conclusion.

In addition, the amount of state information required is very small in either implementation. This makes context switching inexpensive and therefore allows high-priority input to interrupt lower-priority processing.

IV. AMBIGUITY

A. The Problem

Propositional production systems can be checked for consistency, redun- d a n ~ y , ' ~ ~ * ~ and nontermination.20 Building on the results of this article, the decision-table oriented consistency analysis such as advocated by Cragun and SteudelI9 can be applied to any propositional system. The decision table view can thus be seen to simplify checking of the logic of propositional expert systems.

Ambiguity is of particular concern in the present work. Algorithms such as ID3l1 for conversion of a decision table into a decision tree require that no input be possible which can lead to more than one conclusion. If ambiguity is present, the knowledge engineer must make a correction. That this can be a serious problem is illustrated by a detailed consideration of the Garvan ESI system.

Recall that the 661 rules in the Horn clause representation of Garvan ES1 have been transformed into a decision table with 5268 rows. A decision tree executes a number of nodes logarithmic in the number of rows in the decision table, so that a table with 5268 rows would be executed with about 13 decisions. This would be a very fast program in C or COBOL.

The condition for unambiguity is equivalent to the requirement that every pair of rows with different decisions in the table are mutually exclusive. For- mally, if one of the pair of rows is the conjunction of propositionsp; and the other the conjunction of propositions qj, then the two rows are mutually exclusive if

p1 & p 2 & . . . & p , & q1 & q2 & . . . & qm = falsity

It is assumed that

p , & p 2 & . . . & p a # falsity


q, & q2 & . . . & qm # falsity

A sufficient condition for two rows to be mutually exclusive is that

pi = -q, for some i and j

A very simple example of this is the following decision table

If it flies, it is a bird

If it has four legs, it is an animal

Suppose a flying fox is encountered, which has four legs. Both rows of this decision table will fire. One possible way for the knowledge engineer to alter the table to give a correct response is to change the first row to

If it has two legs and flies, it is a bird.

In Garvan ES1, of the 13.4 million such pairs generated from the 5268 row decision table, a tiny fraction, 0.3%, were not mutually exclusive. In order to remove the ambiguity, the knowledge engineers responsible for the system must amend the rules, as in the example above.

Unfortunately, that tiny fraction represents 42 345 pairs of rows, which is overwhelming to the knowledge engineers. This section describes the develop- ment of computerized support for the knowledge engineers, taking advantage of a body of 9805 correctly analyzed cases, which was able to completely eliminate the ambiguity from Garvan ESI . Furthermore, some insights have been gained into the origin of ambiguity and redundancy in these systems.

B. Reduction of Ambiguity

The algorithms described are reported in context of Garvan ES1. That system forms propositions by assignment of values to 34 variables, which have a total of 93 possible values. There are 125 possible clinical interpretations, which are intended to be mutually exclusive. Maintenance of rules in the system is controlled by a set of 291 cornerstone cases, which are the cases which have triggered changes in the rule base. Any change in the rules is verified by checking that these cases are correctly interpreted. There is also available a file of 9805 correctly interpreted cases, which are the result of one year’s operation of the system, and which includes the cornerstone cases.

The small degree of ambiguity found has not been a problem in the routine use and maintenance of the system. None of the 291 cornerstone cases gives more than one interpretation, although 123 of the 9805 do.

The problem at hand is to reduce the 42 345 ambiguous pairs of rows to something more manageable, using as a resource the file of 9805 correctly interpreted cases.

The first approach taken is to specialize one of an ambiguous pair of rows by addition of the negation of a proposition taken from the other. For example, one row might be


antithyroid & ft4-missing & fti-normal & t3-10~ & tsh-low & tt4-normal

while the other might be (5 )

surgery & ft4 - missing & fti-normal & t3-10~ & tsh-low & tt4 - normal (6)

The proposition surgery from (6) does not appear in (5). It may be taken as a defining condition for (6)’s conclusion. If (5) is altered to

antithyroid & ft4-missing & fti-normal & t3low & tshlow & ttQnorma1

then the two rows are mutually exclusive. This procedure does not change the conclusion reached from a particular input. However, it is possible that an input which formerly gave a conclusion will give none, so that the changed system must be tested.

A difficulty with this approach is that the input propositions are not necessarily independent, so that addition of a proposition to a row may prevent that row from firing at all.

& -surgery (7)

There are three classes of dependency:

0 Logical: there are propositions like age0-14 (age of patient 14 years or less) and age-70 (age of patient 70 years or more) which logically cannot co-occur but which are not represented as mutually exclusive in the rule set. If we add age0-I4 to a row containing age-70 the row is logically inconsistent. This situation is fairly easily handled by expressing the propositions in a logically independent way, and this could with some difficulty be done transparently to the user.

0 Physical: propositions like male and pregnant are logically consistent but physically impossible. If male is added to a row already containing pregnant, the row remains logically consistent although a case which could cause that row to fire can never occur.

0 Statistical: propositions like age0-14 and pregnant are logically consistent and physically possible but statistically unlikely in the population generating cases. Adding age0-14 to a row containing pregnant would result in a row which would very rarely fire.

Physical and statistical dependency could in principle be modeled in the system with additional rules, but would require additional effort from the domain experts. In the opinion of the knowledge engineer, this additional information would be difficult to obtain in the case of Garvan ES 1.

A useful resource, however, is the file of 9805 correctly interpreted cases (of which 3877 are distinct). If all possible values of two variables co-occur in these cases, then the two variables must be independent. Since the measurement is statistical, it is not conclusive: additional pairs of variables may be discovered to be independent in the future.

A co-occurrence matrix of the 94 possible values of the 34 variables was constructed from the 3877 distinct cases. It was observed that no variable was independent of all other variables, but that there were many pairs of independent variables.

The procedure outlined in (9, (6), and (7) above was carried out, with the

3 14 COLOMB AND CHUNG

proposition added being one which was independent of all the propositions already in the first row. In addition, to avoid overspecializing a row we associated with each row the cases which cause that row to fire. When considering a particular specialization of a row, we checked that the specialization would still permit all these cases to fire, rejecting that specialization it would not.

Formally, we have two rows of the table,p and q, wherep is the conjunction of propositions p i and q the conjunction of propositions qj. The two rows are not mutually exclusive. A proposition is the assignment of a value to a variable, so that associated with each proposition is a variable u k . We have computed a boolean cross reference matrix A', with entries xk l = true if every value of u k co-occurs with every value of ul in the set of cases. Let Kp be the cases which satisfy row p and for which the conclusion of rowp is the correct one. Similarly, let K4 be the cases which satisfy row q and for which the conclusion of row q is the correct one. We assume that although some of the cases satisfy the incorrect row, all the cases in each set are distinguishable from all the cases in the other. The algorithm is to alter the rows p and q in such a way that none of the cases in K p are consistent with row q and none of the cases in Kq are consistent with row p , but that all the cases in K, satisfy row x for x = p , q.

The algorithm is:

1.1 choose a proposition pi not chosen before, with uk as its associated variable, such that uk is distinct from all the variables uI associated with the qj, and for which xkl is true for every 1 associated with one of the qj. If there is no such proposition, then proceed to step 2.1

1.2 replace q with q & --pi, and check that the result is consistent with all the cases in Kq.

1.3 if step 1.2 succeeds, then the alteration to row q is the desired correction. Otherwise return to step 1.1

2.1-2.3 these steps are the same as the previous steps with p and q reversed. If step 2.1 fails, then the ambiguity can not be removed.

This algorithm preserves the correct behavior of the decision table, since the altered rows from steps 1.2 and 2.2 satisfy all the cases which the original rows satisfied. In addition, the ambiguity is removed, since if either step 1.2 or step 2.2 is executed then the resulting rows are inconsistent with each other.

This process left a residual of 1590 cases of the 42 345 pairs of ambiguous rows, which is a significant reduction in the number of items to be examined by the domain experts. Although the problem has been reduced from overwhelming to merely arduous, a further reduction is desirable.

Another way to use the co-occurence matrix is to consider the pairs of propositions which are never true together. If a row in the decision table included one of these pairs, then it could never fire, and could be eliminated. We found, however, that no rows contained any such pairs of propositions.

A complementary approach was to consider which of the 5268 rows in the decision table were actually used in the 3877 distinct correctly interpreted cases.


It turned out that only 751 rows were used. The major portion, 86%, of the rows were not used, and were therefore redundant. If these rows are removed, then the scale of the problem is reduced significantly, although the possibility exists that a new case could occur which would require one of the removed rows.

The 751 remaining rows have 275 486 pairs with different conclusions. Of these, 3352, or 1.2%, are ambiguous. Applying the ambiguity reduction procedure outlined above results in 74 pairs of ambiguous rows, well within human capability.

By directing attention to the cases rather than to the rules, a further ambiguity reduction process was developed, which was able to correct all ambiguous pairs of rows in Garvan ES 1.

Consider the cases which correctly cause each of the two rows to fire. A row in the decision table can be viewed as a constraint on the possible values of variables. If a case satisfies the constraint, then the row fires. Some of the variables may be constrained to take on a single value, while some may be allowed to take multiple values. A “don’t care” variable is unconstrained: it may take on any value. The previous specialization algorithm may be viewed as adding a further constraint to one of the rows such that the two become mutually exclusive, but that all cases previously satisfying the constraints con- tinue to do so. The choice of variable was, however, limited to those which were constrained to take on a single value in the other row. We can widen the choice, and therefore perform better.

Step 1 . We select a variable which is allowed more than one value in at least one of the two rows. If the set of values actually occurring in the cases satisfying one of the rows is disjoint from the set of values actually occurring in the cases satisfying the other, then strengthening the constraint in the relevant rows will have the desired result. If no such variable can be found, then proceed to step 2.

Step 2. We select a variable which is allowed more than one value in both of the two rows, for which the set of values actually occurring in the set of cases for one row is not identical to the set of values actually occurring in the set of cases for the other. The intersection of the two sets of values is not empty, otherwise the variable would have satisfied the conditions for step 1. We will call this intersection the set of common values. We replace the constraint in each row with the set of common values. At least one row has cases for which the value is not in the set of common values. These are accommodated by the formation of a new row with the constraint being the values occurring in those cases. Both rows may be thus split. The new rows are mutually exclusive with each other and with the further constrained original rows. The cases are distributed to the new rows whose constraints they satisfy. The result of this step is to reduce the number of cases associated with the ambiguous rows. The algorithm then returns to step 1. If no variable can be found meeting the condition of step 2, then the ambiguity cannot be resolved.


For example, suppose we have two rows p and q which are not mutually exclusive. Further, we have two variables color and shape which are not constrained by either of the rows. If the cases associated with the rows are

row p (red, square), (blue, circle) row q (green, square), (green, circle)

Then if row p is specialized with the constraint color = red or blue and row q is specialized with the constraint color = green according to step 1, then the two modified rows are mutually exclusive. If the cases are

row p (red, square), (blue, circle) row q (blue, square), (green, circle)

then step 1 fails. By step 2, we can construct two new rows, p’ with the constraint added to p color = red, and q‘ with the constraint added to q color = green. In addition, we add the constraint color = blue to both p and q. The cases are allocated

row p’ (red, square) row p (blue, circle) row q (blue, square) row q’ (green, circle)

Step 1 can now be applied to further specialize p with shape = circle and q with shape = square. This makes the rows mutually exclusive.

If this last algorithm fails, then the cases associated with the two rows have the same set of values for each of their variables. They are not necessarily identical. It is possible that they may be distinguished by pairs of values. For example, one row may have the cases (red, circle) and (blue, square) while the other may have the cases (red, square) and (blue, circle).

This algorithm begins to shade into the techniques of machine learning21 which in this context would be used to make the selection of variables and could possibly address the situation where the present algorithm fails. Note that the aims of the present procedure differ from machine learning. The latter is attempting to produce optimal classifications of a set of cases, while the present work is simply attempting to patch a set of rules to eliminate unlikely input cases.

Another way to deal with ambiguity in a rule base is to assume that the knowledge is correct and to consider its procedural semantics with respect to a particular inference engine.22 In this approach, the inference engine fires one rule at a time, and will resolve ambiguity by selecting a single rule to fire, according to some meta-rule. (This is called conflict resolution.) The declarative resolution of ambiguity consists in ‘‘generating exclusion clauses which describe the conditions under which those instantiations which appear to lose out during conflict resolution would actually be able to fire”. These exclusion clauses are added to the conditions of the losing rules. The approach is based on a process of abstract interpretation, which is particularly simple in the propositional case.


It can remove ambiguity derived from deterministic conflict resolution strategies such as speciJcity.

The abstract interpretation approach has several differences from ours:

0 It is local, whereas we take a global view of the knowledge base by flattening the rules. Ambiguity does not ultimately matter unless the an input can entail two different conclusions.

0 Conflict resolution is arbitrary in some cases.23 The knowledge engineer may consider ambiguity to be an error.

Our approach is more general with respect to ambiguity, although restricted to the propositional case. On the other hand, our transformation algorithm is essentially the same as the propositional case of abstract interpretation. It would be possible to incorporate the ambiguity removal into the transformation process.

C. Source of Redundancy

We have seen that a small percentage of ambiguity can take a major effort to correct. In addition, it was noted that the Garvan ESl system was highly redundant and much of the ambiguity occurred in the redundant part. It is useful to consider the source of the redundancy.

The transformation from the rule set to a decision table consists essentially of replacement of intermediate assertions with expressions which imply them. It thus tends to reduce the number of rules. It is possible to imagine a rule set with 390 rules asserting final conclusions which can be transformed into a decision table with 390 rows. Garvan ES1 has 390 rules asserting final conclusions, but transforms into a decision table with 5286 rows. A sketch of the source of this increase was given in the analysis of the algorithm, above. More detail is presented here.

One source of increase in the number of rows is intermediate assertions which are the consequent of more than one rule. Replacement of such an assertion in the antecedent of a rule results in a disjunction, which converts to more than one row when the final decision table is constructed by conversion of the expression into clausal form. For example,

a & b + c & d + e

becomes

a & b + e c & d + e

Garvan ESl has 39 intermediate assertions generated in more than one rule. A second source, which is probably greater in impact, is intermediate

assertions which are negated in the antecedent of a rule. The negation of a conjunction is a disjunction of negations. For example,

-(a & b & c) + e


becomes -a +. e -b-+ e -c+ e

Garvan ES1 has 25 assertions which are negated in the antecedent of at least one rule.

- (a & b & c + d & e &fl+ g

becomes

These two sources of expansion interact explosively. For example

-a & -d+ g -a & -e+ g -a & -f+ g - b & - d - + g - b & - e + g -b & -f+ g - c & - d + g -c & - e+ g -c & -f+ g

Garvan ESl has 17 assertions derived in more than one way which also appear negated in the antecedent of at least one rule.

It should be emphasized that the decision table representation is exactly equivalent in behavior to the original rule set. Any input to the decision table which generates more than one conclusion will generate the same conclusions in the original rule set. The decision table representation simply makes these possibilities more evident.

It would appear, then, that in a rule set where it was important to control ambiguity and redundancy, it would be important to control intermediate assertions generated in more than one way, negated, and especially both. These assertions, particularly the latter two classes, might be calledfecund assertions, since they produce a large number of offspring. A knowledge maintenance too113~23:24 should be able to identify such assertions.

When the knowledge engineer wishes to remove a fecund assertion it might be useful for the tool to present to the engineer the expressions which imply the assertion. This expression might be useful in finding an alternative way to express the condition. The same sort of facility could be used by the tool to automatically check for inconsistencies any antecedents containing assertions.

V. RULE CONSTRUCTION VERSUS RULE INDUCTION

A. Expert System as a Stochastic Process

There are two main strategies for building expert systems: construction of a set of rules by for example interviewing a domain expert; and induction of rules from a sample, possibly statistical, of correct cases. Since any proposi-


Table I. Histogram of number of rows fired by a given number of cases. Number of Cases Number of Rows

1 353 2 113 3 58 4 39

5-9 17 10-24 55 25-49 40 50-99 9 loo+ I

tional expert system is equivalent to a decision table, the relationship between the two approaches becomes clearer.

We can imagine that there is a “real” expert system characterizing the real-world process which generates input to the built expert system, and which determines correct output from the built system. This system can always be viewed as a decision table. The decision table can be represented either in its full detail or equivalently reduced with maximum use made of “don’t care” conditions. This table in its full detail is an enumeration of the possible combinations of variable values generated from the real-world process, together with the output associated with each. We can call this the characteristic table, either extended or reduced. The characteristic table completely determines the cases which can be observed.

If we aim to construct an expert system by induction from a set of observed cases, then the characteristic table is of course unknown. However, the process generating cases can be considered as a process selecting rows from the characteristic table, according to some probability distribution, and can therefore be considered as a stochastic process.

Looking at the Garvan ES1 system from this point of view, we see that although the rule base is equivalent to a reduced decision table with 5286 rows, only 751 rows are fired by the 9805 test cases. Table 1 shows a histogram of the number of rows which fire on different numbers of cases. Note that nearly half of the rows fire on only one case.

The remaining 4517 rows of the decision table generated from the 661 rules are not used in the sample of cases.

It is plausible that a large number of the 4517 unused rows are outside the domain of the characteristic expert system. In practice, this does not affect the behavior of the system, since forbidden combinations of variables do not occur, so that those rows are never exercised. Garvan ES1 was built using the rule construction approach.

The inductive approach works from a sample of cases. Assuming no errors, the sample will be a subset of the rows of the extended characteristic decision table. As the sample size increases, the subset will tend towards completeness. If there are low probability regions of the probability distribution, complete


convergence will take a large number of samples. Since forbidden combinations of variable values will by definition not appear in the sample, convergence will always be monotonic.

Statistical rule induction techniques such as ID3” construct from the sample a decision tree which generates the same conclusions as the sample. This decision tree is in some sense minimal. There may be variables whose values occurring in the sample are not used in the decision tree, but those values cannot be reduced to “don’t care” conditions. On the other hand, there may be variables which are “don’t care” conditions in the reduced characteristic table, but all the possible combinations have not yet appeared in the sample. The decision table equivalent to the generated decision tree is therefore an approximation to the reduced characteristic table, but will in some rows be more general and in some, less.

Another form of rule induction, called ripple-down rules,” builds the decision tree from a small number of cases, using a domain expert to nominate the variables which best distinguish the conclusions. This approach also produces a tree generating the same conclusions as the sample. Since it uses more information than the statistical approach, it can be expected to converge on fewer samples than purely statistical methods.

Preliminary indications are that using ripple-down rules, Garvan ES1 can be re-implemented as a decision tree with fewer than 1000 leaf nodes.

B. Induction Preferred to Construction

The analysis in the previous section has indicated that the rule construction approach can produce an expert system with a large number of rows in its decision table which are never fired by an input. Inductive approaches appear to produce tables with many fewer unfired rows. We have also seen that a system with fecund intermediate assertions can have a large number of rows in its equivalent decision table. This section looks a little deeper into the phe- nomenon.

Any characteristic decision table can be constructed using induction techniques. The rule construction approach, if it involves nontrivial intermediate assertions, must generate a decision table with symmetry. For example, an intermediate assertion used in the inference base of more than one conclusion will result in rows (possibly a large number) with a common conjunction of propositions. Decision tables exist with no nontrivial common conjunctions of propositions, e.g.,

f, &f* + CI

-fi&f3 ‘C2

fi Lk -h --f c3

-f, & f 2 -3 c4

It should be clear, therefore, that there exist characteristic decision tables which are not the product of a rule system involving nontrivial intermediate assertions. It follows that the rule construction approach should be used only on systems

PROPOSITIONAL EXPERT SYSTEMS 32 1

where there is good reason to believe that the underlying problem has a great deal of symmetry, and as well that symmetry follows naturally from the problem domain as perceived by the domain experts. This would typically be a system of nested classifications or a system based on a set of regulations. Other things being equal, a rule induction approach of knowledge acquisition should be used.

Another way to see this problem is to consider the fine-tuning of a decision table as might occur during m a i n t e n a n ~ e . ~ ~ A decision table can be regarded as a system for making classifications. Since propositional expert systems are equivalent to decision tables, a problem implemented in a propositional expert system is formally equivalent to a classification problem. Maintenance of the decision table is the correction of a misclassification. For example, take the following original rule set

f i &h+ a (8)

-a+ c, (9)

fi & f 3 + c2 (10)

-f1 + CI {from (8) and (9)) (1 1)

-f2 + CI {from (8) and (9)) (12)

f i & f 3 + c2 {rule (10)) (13)

which is transformed into the decision table

An input

fl -f2 & f 3

will cause both row (12) and (13) to fire, so will yield both conclusions c, and c2. For this input, let us assume that the expert determines the correct conclusion to be c2. Row (12) therefore fired in error. The expert may decide that the way to correct the system is to add -f3 to row (12), resulting in the decision table

-fi + c1 ((1 1)) (15)

--A & -f2 -+ c1 {from (12)) (16)

fi f 3 -+ c2 ((1311 (17)

The decision table has been corrected to remove a misclassification. The change made was local to the row which fired incorrectly, and consisted of adding a single proposition.

Consider now making an equivalent change to the original rule set (8-10). Observe that there is no way to add a single proposition to the original rule set to make it equivalent to the final decision table (15-17). A conjunct added to rule (8) would add another row to the decision table (11-13). A conjunct added to rule (9) would modify both rows ( 1 1 ) and (12). Rule (10) is irrelevant to the problem of modifying row (12).

The local change made to get row (16) can be seen as “breaking the


symmetry” imposed by the intermediate assertion in the rules (8-10). In order to correct the original rules, we must remove the intermediate assertion, thereby simplifying the inferential structure. This is not a local‘change to the rule set, and it could have many ramifications.

This example suggests how the large number of unused rows in Garvan ES1 could have arisen. Since removing an intermediate assertion may have many ramifications, the engineer prefers to proceed by local changes, typically adding propositions to particular rules. Say we first add the proposition -f3 to rule (9), giving the system

which, as we have seen causes another problem, since now -fl will no longer give the conclusion c1 as it should. This new problem can be corrected by adding an additional rule

The rule set (18-21) has the correct behavior, but has a row in its equivalent decision table which is inoperative [although in this case subsumed by the row resulting from rule (21)l. The presence of symmetry in the decision table caused by the intermediate assertions in the rule base has led the knowledge engineer to increase the complexity of the rules to cope with a change to the underlying decision table which is symmetry-breaking.

Garvan ES1 nearly doubled the number of its rules in going from 96% to 99.7% accuracy over a period of 4 years.26 It is plausible that the complex symmetries have forced the knowledge engineers to build a rule set with a large number of inoperative rows in its equivalent decision table in order to approximate the correct behavior.

Rule construction approaches tend to yield systems with a large amount of symmetry, while in systems built with rule induction methods symmetries appear in the decision tables only if they are present in the data. As we have seen, the presence of symmetry complicates maintenance in the former class of system, while in the latter maintenance always remains simple. This is the main reason why a rule induction approach is recommended unless the problem has essential symmetry: that is its structure of intermediate assertions has meaning to the domain expert and changes will tend to respect that symmetry. For example, in a system of regulations, the classification of a person as afeEon might be an intermediate assertion, and felons might have certain restrictions on rights. A change would be likely to either alter the characteristics necessary for a person to be classified a felon, or to alter the restrictions on rights. The complete removal of the classificationfelon would be a very major change, and would be unusual.


C. Application to Ambiguity Reduction

The previous result casts light on problems encountered in the automatic fine-tuning of a decision table to reduce ambiguity. A decision table has been built which is ambiguous. The approach taken to remove the ambiguity is to add propositions to some of the rows to make them mutually exclusive, taking into account forbidden conjunctions. Its effect is the same as the maintenance operations discussed above.

Since the changes to the table affect the behavior of the system, the knowledge engineer would ideally like the changes made to be visible in the rule set. It may not be possible to make them visible. We have seen that fine-tuning may easily involve breaking of symmetry, so that it may be that the fine-tuning of the decision table done by the ambiguity-reduction process may not be visible through its effect on the original rule set, because there is no local modification to the rules which yields the result.

If, therefore, a programmer is maintaining a rule set in some kind of rule- maintenance environment which is then compiled into a decision table with its ambiguity removed by the automatic process, the ambiguity removal may introduce unexpected errors into the performance of the system. Such an error will always be a case which is expected to reach a particular conclusion which in fact reaches none at all.

The maintenance environment should check whether the failure is in the rule set or caused by ambiguity reduction. If the latter, the problem can be corrected by re-running the ambiguity reduction process with the new case included in its set of reference cases. Note that the ambiguity reduction procedure is global in scope, so that its re-execution may have nonlocal effects.

D. Case-Based Reasoning

The analysis above has shown that any propositional expert system can be characterized by the set of possible cases that can occur. In summary, we have discussed building an estimator T of a characteristic decision table by induction from a sample S of its stochastic process. The transformation of S into Tpossibly loses information. The transformation of S between its extended and reduced form does not lose information, nor does the transformation of T between its decision tree and decision table forms, since these transformations are in both cases invertible.

The rows of the decision table form of T (the leaf nodes of its decision tree form) partition the rows of the extended representation of S into equivalence classes: two rows of S belong to the same class if they cause the same row of T to fire. The transformation is lossy precisely to the extent to which the rows of T differ from the reduced form of the decision tables given by corresponding equivalence classes of the rows of S.

We can therefore construct a reduced partition of S, denoted {Si}, where Si is the maximal reduction of the set of rows of S fired by row i of T (the row is denoted Ti). The reduction by combination of different values of a proposition

COLOMB AND CHUNG

into a “don’t care” condition can never eliminate a proposition present in Ti, since all the members of Si have the same value for that proposition. We therefore have that the transformation from S to T loses information if any Si has more than one row. (It also loses information when there is only one row if there are propositions in the row not in Ti.)

In using the constructed expert system, we apply T to a new sample s from the stochastic process, which causes a particular row Ti to fire. We can view the result that Ti fired as a statement that s is similar to the members of Si, where the similarity measure is that s and the members of Si have the same value for each of the propositions in Ti. In this way we are viewing the expert system as a comparison of a new case against a library of stored cases.

We can also view the rows as constraints, as in Sec. IV-B. In this view, a row is a specification of a set of equality constraints which, if the sample s fits, indicate the classification associated with the row. The reference cases Si are a body of stored cases which constrain any changes to the classifier constraints, in that any alteration to the row must preserve the correct classification of those cases. This case-based reasoning view can be generalized by the use of other constraint schemes, for example the Hamming distance metric and its relative^.^'-^^

VI. SKETCH OF A SHELL

A. Construction and Maintenance

Once a decision table has been constructed and put into use, the maintenance problem arises. As discussed in Ref.6, this problem can arise from cases encountered which are misclassified, as well as from new knowledge.

The ripple-down rule formalismZ5 is intended to deal with this situation. The formalism works with the decision tree formulation of the expert system. Assuming that the decision tree is built so that when an attribute is tested, a branch is created for each possible value. The decision tree will always be complete, in that every case will proceed to a leaf node. When a case is misclassified, an additional decision is added to the leaf node reached. The attribute chosen (by the domain expert) must not disturb the cases which have correctly reached that node. A tool has been built by C ~ m p t o n * ~ which presents the expert with a constrained choice among permissible attributes.

A similar approach would work in the decision table formulation. The decision table is more general than a decision tree in that it is possible to build useful decision tables which are incomplete or ambiguous. Of course, the decision table equivalent of a decision tree is complete and unambiguous. If the case is misclassified by the table, and the row which fired is mutually exclusive with all others, then the same approach as ripple-down rules applies, which translates into splitting the row. It may be appropriate to select more


than one attribute to perform the split, although doing so will make the table incomplete.

Once a new row has been created, it may be desirable to remove from it attributes which were relevant to the cases fired by the parent row but which are not relevant to the new case. The possibility here is that the new row would make the table ambiguous. The variables added guarantee it to be exclusive with its sibling, but its guarantee of being mutually exclusive with all other rows is removed when any variables in the parent row are removed.

Suppose the table is incomplete and a case fails to fire any rows. One approach to correct the table is to first get the domain expert to select those attributes deemed to be relevant to the case, and replace all others with “don’t care.” The modified case would then either cause a row to fire, so that the preceding procedure could be applied, or not. If the latter, the modified case itself could be appended to the table as a new row. This last might make the table ambiguous.

The table may be ambiguous. There are two possibilities. First, the case may cause more than one row to fire. One (or more) of the rows may be correct or all incorrect. If one or more is correct, the others must be modified to exclude the case, which modification may make the table incomplete. If all are incorrect, one row must be chosen for splitting and the others modified to exclude the case.

Secondly, although the case itself caused only one row to fire, a row added to the table may not be mutually exclusive with all other rows. In this case, the approach to ambiguity reduction described above applies.

The extension of ripple-down rules to the decision table formulation is summarized:

1 . Decision table complete, unambiguous. Split incorrect row

Possibly incomplete if split on multiple attributes Possibly ambiguous if parent attributes removed

2. Decision table incomplete 2.1 Case fires row: proceed as in 1 . 2.2 Case fires no row

Strip irrelevant attributes from case Stripped case fires row: proceed as in 1 Stripped case fires no row: add stripped case as new row

Possibly ambiguous 3. Decision table ambiguous

3.1 Case fires at least one correct row and some incorrect rows: Modify incorrect rows to exclude case

Possibly incomplete 3.2 Case fires no correct row but more than one incorrect row

Select row to split, proceed as in 1 and as in 3.1 for others. 3.3 Case fires no correct and one incorrect row: proceed as in 1 .

4. Ambiguous row added: use ambiguity reduction procedure.


B. Explanations

An important feature of expert systems technology is the ability of a system to provide explanations of its conclusions and actions. In propositional systems, these explanations are typically derived from a trace of the rules executed. If the system is built by some form of induction directly as a decision table, this form of explanation is not available.

An alternative explanation strategy is possible, related to couer-and-differentiate systems.29 Consider the set of propositions whose conjunction forms a row of the decision table. The set of values for a particular proposition will have a probability distribution. It may be that some of the values are much more probable than others: for example a reading may be either normal or abnormal, with abnormal readings relatively rare. A row in the table may be

measurement-1 = abnormalp = 0.01 measurement-2 = normalp = 0.98 sex = male p = 0.45

It is plausible that measurement-2 is in the row because it is characteristic of the conclusion, while measurement-2 serves mainly to distinguish this conclusion from some other conclusion. The final proposition sex = mate may be in either category.

If the propositions were identified in this way, an explanation could be of the form:

The conclusion is characterised by measurement-1 = abnormal and sex = male. This conclusion is distinguished from conclusion-2 by the fact that measurement2 = normal.

This sort of explanation can be easily derived from the decision table representation by identifying the set of conclusions consistent with the characteristic propositions, then reporting which incorrect conclusions are excluded by the distinguishing propositions.

If the propositions are not identified by the domain expert and a set of cases is available, an approximation may be to compute the histogram of each proposition, and identify all those of probability less than a threshold as characteristic with the others identified as distinguishing. This measure may be viewed as computing the information content of each value of each proposition, with the characteristic propositions identified as the high information values.

This approach is distinguished from MOLE,29 in that it is much simpler. MOLE uses a propositional structure with a deep hierarchy of intermediate assertions. Although our approach can be adapted to such a hierarchical system of classifications, it is best adapted to a flat structure where the expert relates observable conditions to observable classifications. E ~ h e l m a n ~ ~ in fact states that such flat structures seem to be the natural way experts express their knowledge.


VII. CONCLUSIONS

We have shown that propositional expert systems can be transformed into decision tables by a process closely related to assumption-based truth maintenance. The transformation algorithm is very general, and can be adapted to a number of considerations important in the implementation of practical systems. One benefit is that the decision table gives a global view of the expert system, which allows many of the benefits of ATMS to be obtained at low cost. The main benefit, however, is that it is possible to execute propositional expert systems rapidly and in bounded time, so that real-time implementations become more practical.

One way to get an efficient implementation of a decision table is to convert it into a decision tree. This procedure requires that the table be unambiguous. Analysis of a substantial test case reveals that ambiguity is a surprisingly serious problem. An automated method is presented for reducing ambiguity, which in fact completely corrects the problem in the test case. In the course of analysis of the test case for ambiguity, it was further discovered that the system was extremely redundant. The source of redundancy was investigated, and found to lie in an inherent weakness in the method of building an expert system by rule construction. It was demonstrated that rule induction methods are more general than rule construction methods, and that except in special circumstances rule induction is preferable. The analysis shows as well that propositional expert systems can be viewed as case-based reasoning. The analysis also leads to some suggestions for expert system shell tools.

This article is based in part on an article by the same authors, entitled “Very fast decision table execution of propositional expert systems,” which was presented at the Eighth National Conference on Artificial Intelligence, Boston, Massachusetts, 1990.

The authors wish to thank Andrew Parle, Claude Sammut, Paul Compton, Norman Foo, Kim Horn, and Ross Quinlan for suggestions and comments which have improved and clarified this work.

References

1. P. Beinat and R. Smart, “COLOSSUS: Expert assessor of third party claims,” in Proceedings Fifth Australian Conference on Applications of Expert Systems, Syd- ney, Australia, 1989, pp. 70-85.

2. B. Buchanan, “Expert systems: Working systems and research literature,” Expert

3. K.A. Horn, P. Compton, L. Lazarus, and J.R. Quinlan, “An expert computer system for the interpretation of thyroid assays in a clinical laboratory,” Austral. Computer J . , 17(1), 7-11 (1985).

4. T.A. Nguyen, W.A. Perkins, T.J. Laffey, and D. Pecora, “Checking an expert system knowledge base for consistency and completeness,” in ZJCAZ-85, Morgan Kaufmann, Los Altos, CA, 1985.

5. K.R. Apt, H.A. Blair, and A. Walker, “Towards atheory ofdeclarative knowledge,” in Foundations of Deductive Databases and Logic Programming, J. Minker, Ed., Morgan Kaufmann, Los Altos, CA, 1988, pp. 89-148.

6. H. Tamaki and T. Sato, “Unfold/fold transformation of logic programs,” Proceed-

Syst. 3, 31-51 (1986).

ings of the Second International Logic Programming C o n f e r & ? , Uppsala, 1984, pp. 127-138,


7. H. Seki, “Unfold/fold transformations of stratified programs,” in Logic Program- ming: Proceedings of the Sixth International Conference (Lisbon) G. Levi and M. Martelli, Eds., MIT Press, 1989, pp. 554-568.

8. R. Reiter, “Deductive question-answering on relational databases,” in Logic and Databases, H. Gallaire and J. Minker, Eds., Morgan Kaufmann, Los Altos, CA,

9. J.D. Ullman, Principles ofDatahase and Knowledge-Base Systems Volume 2 , Com- puter Science Press, Rockville, MD, 1989.

10. J.R. Metzner and B.H. Barnes, Decision Table Languages and Systems, Academic Press, New York, 1977.

11. J.R. Quinlan, “Semi-autonomous acquisition of pattern based knowledge,” in Ma- chine lntefligence 10, J.E. Hayes, D. Michie, and Y-H Pao, Eds., Ellis Horwood, Chichester, UK, 1982, pp. 159-172.

12. R.M. Colomb and M.W. Allen, “Architecture of the column computer,” in Proceed- ings Conference on Computing Systems and Information Technology, Institution of Engineers, Australia, 1989.

13. R. Jansen and P. Compton, “The knowledge dictionary: An application of software engineering techniques to the design and maintenance of expert systems,” in Pro- ceedings AAAI-88 Workshop on Integration of Knowledge Acquisition and Perfor- mance Systems, Minnesota, 1988.

14. M. R.-Y. Lee, “The implementation of a knowledge dictionary in SQL,” in Proceed- ings Oracle Asia-Pacijic User Conference-1 990 Adelaide, Australia, 1990.

15. J. de Kleer, “An assumption-based TMS,” Artif. Intell., 28, 127-162 (1986). 16. C.L. Mason, R.R. Johnson, R.M. Searfus, and D. Lager, “SEA-An expert system

for nuclear test ban treaty verification,” in Proceedings of the Australian Joint Artificial Intelligence Conference, Sydney, 1987, pp. 11-25.

17. C. Sammut and R. M. Colomb, “Using truth maintenance to compile a real time expert system,” in Proceedings Fourth Australian Joint Artijicial Intelligence Con- ference, Perth, WA, Australia, 1990.

18. R. Reiter and J. de Kleer, “Foundations of assumption-based truth maintenance systems: Preliminary report,” AAAI-87, 1987, pp. 183-188.

19. B.J. Cragun and H.J. Steudel, “A decision-table-based processor for checking completeness and consistency in rule-based expert systems,” Int. J . Man-Machine

20. H. Kleine Buning, U. Lowen, and S. Schmitgen, “Inconsistency of production systems,” Data Knowl. Eng., 3, 245-260 (1989).

21. J.R. Quinlan, “Induction of decision trees,” Mach. Learn. 1, 81-106 (1986). 22. R. Evertsz “The automated analysis of rule-based systems, based on their procedural

semantics,” IJCAI-91, Morgan Kaufmann, Los Altos, CA, 1991, pp. 22-27. 23. C.A. Lindley and J.K. Debenham, “The knowledge analyst’s assistant: Description

and preliminary evaluation,” in Proceedings Australian Software Engineering Con- ference, IREE Australia, 1990.

24. L. Brownston, R. Farrell, E. Kant, and N. Martin, Programming Expert Systems in OPS-5, Addison-Wesley, Reading, MA, 1985.

25. P. Compton and R. Jansen, “A philosophical basis for knowledge acquisition,” Knowl. Acquisition, 2, 241-258 (1990).

26. P. Compton, K. Horn, J. R. Quinlan, L. Lazarus, and K. Ho, “Maintaining an expert system,” in Proceedings Fourth Australian Conference on Applications of Expert Systems, University of Technology, Sydney, Australia, 1988.

27. R.M. Colomb, Artijicial Intelligence Applications of a Fine-Grained Parallel Ma- chine, Technical Report TR-FB-90-08 CSIRO Division of Information Technology, Sydney, Australia, 1990.

1978, pp. 149-177.

Stud. 26, 633-648 (1987).

28. T. Kohonen, Content-Addressable Memories, Springer-Verlag, Berlin, 1980. 29. L. Eshelman, “MOLE: A knowledge acquisition tool for cover-and-differentiate

systems,” in Automating Knowledge Acquisition for Expert Systems, S. Marcus, Ed., Kluwer, Boston, 1988, pp. 37-97.

strategies for building propositional expert systems

Documents