cautious induction: an alternative to clause-at-a-time hypothesis construction in inductive logic...

New Generation Computing, 17(1999)25-52 OHMSHA, LTD. and Springer-Verlag

�9 OHMSHA, LTD. 1999

Cautious Induction: An Alternative to Clause-at-a-Time Hypothesis Construction in Inductive Logic Programming

Simon A N T H O N Y and Alan M. FRISCH Intelligent Systems Group Department o] Computer Science University of York York YOlO 5DD, United Kingdom { simona, frisch}@cs .york .ac .u.k

Received 28 February 1998 Revised manuscript received 9 June 1998

Abstrac t Hypotheses constructed by inductive logic programming (ILP) systems are finite sets of definite clauses. Top-down ILP systems usually adopt the following greedy clause-at-a-time strategy to construct such a hypothesis: start with the empty set of clauses and repeatedly add the clause that most improves the quality of the set. This paper formulates and analyses an alternative method for constructing hypotheses. The method, called cautious induction, consists of a first stage, which finds a finite set of candidate clauses, and a second stage, which selects a finite subset of these clauses to form a hypothesis. By using a less greedy method in the second stage, cautious induction can find hypotheses of higher quality than can be found with a clause-at-a-time algorithm. We have implemented a top-down, cautious ILP system called CILS. This paper presents CILS and compares it to Progol, a top-down clause-at-a-time ILP system. The sizes of the search spaces confronted by the two systems are analysed and an experiment examines their performance on a series of mutagenesis learning problems.

I(eywords: Machine Learning, Inductive Learning, Inductive Logic Pro- gramming, Cautious Induction.

w 1 I n t r o d u c t i o n Any object in the world a round us can be classified as either a positive

or negative example of a par t icular concept. For instance, some molecules are

26 S. Anthony and A. M. Prisch

carcinogens, some aren't. The task faced by an inductive logic programming (ILP) ~) system is to find a set of rules (a hypolhesis) that accurately explains the classification of these examples in terms of properties already known about them (background knowledge). The quality of a rule or a hypothesis is usually judged in terms of its classification accuracy and its size. Once found, a hypothesis can be used to classify additional examples that were not involved in its construction.

Many top-down ILP systems, such as FOIL 6) and Progol, 4) adopt the following greedy clause-at-a-time strategy to construct such a hypothesis: start with the empty set of clauses and repeatedly add the clause that most improves the quality of the set. Though ILP research has developed sophisticated methods that can successfully find the best clause at each iteration, surprisingly little research has at tempted to overcome the inherent greediness of the clause-at-a- time strategy.

To see that the greediness of the clause-at-a-time strategy can lead to a suboptimal hypothesis being constructed, consider the following problem of learning rules to classify whether an individual is a student. The learner is given a training sample of eight examples of students, and has some background information about them, as shown in both tabular and graphical form in Fig. 1. The learner's goal is to find a hypothesis which contains the smallest number of rules, each of which is chosen from those listed in Table 1, that correctly classifies all eight positive examples.

A clause-at-a-time hypothesis construction algorithm would begin by selecting rule (R3) as it correctly classifies more examples--f ive-- than any of the other possible rules. It would then select rule (R1) since it covers more of the remaining three examples than the other possible rules. Finally, with one example left uncovered, the algorithm could select rule (RE). ~ Thus, this clause-at-a- time strategy would find the three-rule hypothesis {R3, R1, RE} even though a smaller hypothesis, {R1, R~}, exists.

Notice that the learner's initial greedy commitment to rule (R3) has pre- vented the construction of this higher quality hypothesis. This problem may become increasingly acute for learning problems which have large training samples or large numbers of possible rules.

It seems that this shortcoming of clause-at-a-time hypothesis construction

Name Under Post Lives on Graduate Graduate campus

Anne x x/ x/ Bill ~/ x x

Cath V / x ~/ Dave x x/ x/ Eva X ~,/ X Fred x/ x ~/ Gill x ~ Hank x/ x X

under-graduate

lives

c a m p u s ~

post-graduate

Fig. 1 Students and their background properties

�9 1 It could have also covered this last example with (R2).

Cautious Induction 27

T a b l e 1 T h e possible rules

Possible Rules Examples Covered (R1) X is a s t uden t if X is an unde r -g radua t e . 4

(R2) X is a s t uden t if X is a pos t -g r adua t e . 4 (R3) X is a s t uden t if X lives on campus . 5 (RA) A n n e is a s tuden t . 1 (RB) Bill is a s tuden t . 1

(RH) H a n k is a s tuden t . 1

could be remedied by separating the task of finding single rules from that of deciding which rules should appear in a hypothesis. With this in mind, we propose a strategy called cautious induction, which has two stages:

S tage 1: A single, complete search through the rule space is performed and a finite set of candidate rules is retained.

S t age 2: A subset of these candidate rules is then selected to form a hypothesis, at tempting to maximise some hypothesis quality function.

This paper describes cautious induction in an ILP setting. A top-down search of the rule space is developed, called cautious refinement, which accom- plishes Stage 1. This algorithm, together with two standard weighted subset- cover algorithms for Stage 2, is implemented in the CILS system. CILS is compared against Progol, 4) a state-oSthe-art ILP system, by considering

�9 the complexity of the rule space searched by the two learners, and �9 their performances on a large, complex dataset.

It seems clear that if a complete algorithm for weighted subset-cover is used for Stage 2, CILS will, in general, produce higher quality hypotheses than clause-at-a-time hypothesis construction algorithms, such as Progol. Surpris- ingly, our experiments also suggest that cautious induction can produce higher quality hypotheses than Progol when a greedy approximation algorithm for subset-cover is used.

The rest of this paper is organised as follows. The learning problem addressed by the paper is defined in Section 2. Section 3 describes cautious induction within this setting and presents an algorithm for cautious refinement. Section 4 describes the implementation of cautious induction in the CILS system. CILS is compared to Progol with a complexity analysis in Section 5 and an empirical analysis in Section 6. After a brief survey of some related work in Section 7, we draw our conclusions in Section 8.

We conclude this section with some comments on the notation and ter- minology used throughout this paper. The names of functions and relations are written in SMALL CAPITALS except for refinement operators, which are denoted p by convention. The names of sets are written in capitals and the cardinality of a set S is denoted IS'l. The word clause is taken to mean definite clause throughout this paper. A clause is represented as a finite set of literals and

28 S. A n t h o n y a n d A. M. Frisch

usually denoted c or c' by convention. The head of a clause is the single positive literal in a clause, whilst the body is the possibly empty set of negative literals in a clause. The size of a clause, denoted Icl, is the cardinality of c.

w T h e L e a r n i n g P r o b l e m This paper assumes that learning takes place in the following setting.

The teacher selects a target concept and provides the learner with a training sample comprising a finite set of positive examples and a finite set of negative examples. The positive examples are those that the teacher indicates as being examples of the target concept; likewise, the negative examples are those indi- cated as not being examples of the target concept. The teacher's indications may be incorrect-- that is, there may be classification noise--and the same example may be present in both the positive set and the negative set. The learner also has access to a finite, possibly empty set of background concepts, defined in its background knowledge. From the training sample and its background knowledge, the learner uses a hypothesis quality function to guide the construction of a high quality hypothesis of the target concept. High quality hypotheses

�9 agree as much as possible with the teacher's classification of the examples in the training sample, and

�9 consist of a small number of short rules.

Small, simple hypotheses are preferred as they are thought to provide better explanations than large, complex ones. Additionally, hypotheses should be accurate in their classification of previously unseen examples. This cannot be assessed directly during learning and thus we assume that the training sample contains examples that are typical (or representative) of any example the hypothesis must subsequently classify.

Within this setting we study a learning problem that is considered "top- down" in that its definition employs a refinement operator to define the set of clauses that can be included in a hypothesis. We refer to this problem as the top-down ILP problem or simply as the problem. An instance of the top-down ILP problem is a tuple (79, B, T, B, IC, p) where

�9 P is a predicate symbol with finite arity called the target predicate, which represents the target concept.

�9 B is a finite, possibly empty set of predicate symbols, each with finite arity, which does not contain P . Each symbol is called a background predicate and represents a background concept.

�9 T, the training sample, is a pair of finite sets whose members are atomic formulae that have the predicate symbol 7 9 . These two sets are the positive examples, written T +, and the negative examples, written T - .

�9 B, the background knowledge, is a finite, possibly empty set of clauses that define the predicates in B. All literals in these clauses have predicate symbols from 13.

�9 IC, the initial clauses, is a finite, possibly empty set of clauses, each containing a single positive literal whose predicate symbol is 79.

C a u t i o u s I n d u c t i o n 29

* p, a refinement operator, is a function mapping a single clause c, whose head literal has predicate 7) and whose body literals each have a predicate from/3, to a finite, possibly empty set of clauses. Each clause in this set is of the form c U {--l}, where --I is a negative literal that is not a member of c and whose predicate symbol is a member of B. The clause c U {--,l} is called a refinement of c.

We now turn our attention to defining what constitutes a solution to a problem instance. Informally, a problem instance induces a clause space which is the set of all clauses that can be produced by repeated application of the refinement operator to the set of initial clauses. Every finite subset of the clause space is a potential hypothesis and a solution to the problem instance is a potential hypothesis that maximises a particular hypothesis quality function known as quasi-compression. We first define these principal concepts--clause space and quasi-compression--and then define the solutions to a problem instance.

Def in i t ion 2.1 (C lause Space) Let I = (7), 13, T, B, IC, p) denote a problem instance. The clause space for I, denoted CI, is the smallest set such that

�9 C I ~ IC, and �9 if c E CI then CI D_ p(c).

For a problem instance I = {7), 13, T, B, IC, p) a directed graph, known as the refinement graph for I, can be constructed. The nodes in the refinement graph are clauses from CI and there is an edge from a node c to a node e ~ if and only if c' is a member of p(e). Refinement graphs are acyclic since a refinement of a clause has one more literal than the clause itself. Paths in a refinement graph are known as refinement paths. If there exists a refinement path of length at least 1 from a clause c to a clause c ~, clause e is said to be an ancestor of e ~, and, conversely, clause e ~ is said to be a descendant of c.

The root nodes of a refinement graph, that is the clauses in IC, are said to have depth 1. If a node has depth n then each of its refinements has depth n + 1. Therefore, the depth of a node in the graph is precisely the size of the clause constituting that node.

The quasi-compression of a hypothesis is a function of the number of literal occurrences in the hypothesis and of the positive and negative examples that it covers. We first define the coverage of a hypothesis and then define quasi-compression.

Def in i t ion 2.2 (Cove rage ) Let I = (7), 13, T, B, IC, p) denote a problem instance and H a finite subset of C~. The positive examples of I covered by H, written I+(H), and the negative examples of I covered by H, written I - ( H ) , are defined as

I+(H) = {e E T+I e is entailed by B U H}, and

I - ( H ) = {e E T- le is entailed by BU H}.

30 S. Anthony and A. M. l~risch

By extension, the positive and negative examples covered by a clause c E CI are precisely the positive and negatives covered by the hypothesis {c}, and are denoted I+(c) and I - (c) respectively.

Notice that the following relationship exists between the coverage and size of a clause and that of its refinements.

P r o p o s i t i o n 2.1 Let I denote a problem instance and let c and c' denote two clauses from CI where c' is a descendant of c. Then

I+(c) 2 I+(c'), I - (e) _D I - (e') and

Icl + 1 _< Ic'].

Def in i t ion 2.3 (Quas i -Compres s ion ) Let I denote a problem instance and H a finite subset of CI. The quasi- compression function for I, denoted QCl, is given by:

QcI(H) = f l + ( H ) ] - y ~ (]I-(h)] + ]ht). hEH

By extension, the quasi-compression of a clause c is defined to be the quasi- compression of the hypothesis {c} and is written QCl(C).

Notice that this function captures the desirable properties of hypotheses given in Section 2.

P r o p o s i t i o n 2.2 Let I denote a problem instance and A and B finite subsets of CI. Then

QcI(A U B) = QcI(A) + QcI(B) - [I+(A) N I+(B)I,

from which it directly follows that

Q c I ( A U B ) - Q C I ( A ) < QcI(B).

We are now able to define the solutions to an instance I of the top-down ILP problem. Recall that a potential hypothesis for I is any finite subset of the clause space CI. A potential hypothesis H for I is a solution of I if and only if QcI(H) > QcI(H') for every potential hypothesis H' for I. Notice that the empty set is a potential hypothesis for all problem instances, even those that have an empty clause space. Since QcI(0) = 0, every problem instance has a solution with non-negative quasi-compression.

Our definition of the top-down ILP problem imposes three restrictions that are worth discussing.

(1) T h e h y p o t h e s i s qua l i ty f u n c t i o n is fixed. A variety of hypothesis quality functions could be considered by specifying them as an extra parameter of each problem instance. This has not been done because the novel pruning


method used during the search for candidate clauses exploits the properties of quasi-compression.

(2) H y p o t h e s e s a r e n o n - r e c u r s i v e . This is a consequence of the re- quirement that the predicate symbol that appears in the head of all hypothesis clauses does not appear in their bodies. This restriction is also exploited by the pruning method.

(3) T h e r e f i n e m e n t o p e r a t o r d o e s n o t h i n g o t h e r t h a n a d d a single l i t e r a l to a c lause . Tha t refinement cannot produce a new clause by substitution does not limit the set of refinements of a particular clause. For example, though refinement cannot produce the clause {P(x),-~Q(x,a)) from {P(x) , ~Q(x, y)), refinement could produce it from directly from {P(x)}. Fur- thermore, by using equality as a background predicate, refinement could produce the clause {P(x),-~Q(x, y),-~(y = a)} from {P(x), ~Q(x, y)}.

w C a u t i o u s I n d u c t i o n Given a problem instance I , the aim of cautious induction is to find any

one of the solutions of I. The two stages of cautious induction can now be defined:

S t a g e 1: C a u t i o u s R e f i n e m e n t : A single, complete search through the refinement graph for I is performed and a finite set of candidate clauses for I, known as the candidate set for I , is retained. This candidate set must be a superset of at least one solution of I, and must be finite despite the possibly infinite cardinality of CI.

S t a g e 2: C l a u s e Selec t ion: A solution for I is then selected from these candidate clauses.

Our work so far has concentrated on the development of a cautious refinement algorithm. For clause selection we have only considered the use of standard algorithms for weighted subset cover. .2 Consequently, the remainder of this section presents our development of a cautious refinement algorithm. Section 3.1 identifies two conditions that justify excluding a clause from the candidate set and Section 3.2 identifies two conditions that justify pruning subtrees of the refinement graph while searching it. Finally, Section 3.3 presents a cautious refinement algorithm based on these conditions.

3.1 The Candidate Set This subsection identifies two conditions that justify excluding a clause

from the candidate set produced by cautious refinement. The first condition is that a clause can be excluded if it is not compressive; the second is that a clause can be excluded if there is another clause superior to it in the candidate set. We now define and examine these two conditions.

*2 Cormen, Leiserson and Rivest 1) discuss the subset cover problem and present a greedy approximation algori thm for the problem.

32 S. A n t h o n y a n d A. M. Frisch

Def in i t i on 3.1 ( C o m p r e s s i v e ) Let I denote a problem instance and c a clause from CI. Clause c is compressive for I if and only if qc t (c ) > 0.

T h e o r e m 3.1 Given a problem instance I, a clause c from Ct is absent from every solution of I if c is non-compressive.

P r o o f We assume that c is a non-compressive clause and show that it is not a member of any solution. From Proposition 2.2 it follows that for any potential hypothesis H

Q c I ( H U { c } ) - Q C I ( H ) <_ QCI(C).

This observation, coupled with the assumption that QCl(C) < 0, shows that QcI(H) > QcI(H U {c}). Thus, any potential hypothesis containing c is not a solution since its quasi-compression can be increased by removing c. �9

Therefore the candidate set need only contain compressive clauses, of which there are a finite number as is shown by the following theorem.

T h e o r e m 3.2 Let I = (7 9, I3, T, B, IC, p) denote a problem instance. The clause space CI contains only a finite number of compressive clauses, none of which is deeper than depth ]T+l in the refinement graph for I.

P r o o f Let c E C/ be a clause whose depth in the refinement graph exceeds IT +1. Since the depth of a clause is equal to its size, Icl > IT + I, Furthermore, since IT+l > II+(c)l, it is the case that Icl > I /+ (c ) l . Therefore QCt(c), which is defined to be i I + ( e ) l - I I - ( c ) l - t e l , must be negative. Given that the cardinality of IC is finite and that the maximum number of refinements of any clause is finite, the number of compressive clauses in C / m u s t also be finite. �9

Superiority is the second condition that allows clauses to be justifiably excluded from the candidate set. Let us begin by observing that for a given a problem instance I, a clause c' E CI need not be included in the candidate set for I if there exists another clause c in the candidate set for I such that

QcI (H U {c}) > QcI (H U {c'}) for all finite H C CI - {c'}.

Since this condition involves every potential hypothesis, it is not of any direct use in restricting the set of candidate clauses. Below we give an alternative condition that does not involve quantification over potential hypotheses; we then prove that the condition is equivalent to the one above and that it is a quasi-ordering.

De f in i t i on 3.2 ( S u p e r i o r i t y ) Let I denote a problem instance and c and c' two clauses from CI. Clause c is


superior to c ~ with respect to I, denoted c _>I r if and only if

II+(c) n I+(c ' ) l - I I - ( c ) l - I~1 > qc1(c ' ) .

T h e o r e m 3 .3 Let I denote a problem instance and c and d two clauses from CI. Then, c is superior to c ~ if and only if

for all finite H C Ci - {r qCl(H U {r > qCl(H U {c'}).

P r o o f For all finite H C_ Ci - {c'},

QcI(H U {c}) > QcI(H U {c'})

iff (by Proposition 2.2) for all finite Y C CI - {c'},

QcI(H) + QcI(r -- I / + ( g ) n I+(c)l >

QcI(H) + QCl(C t) - - ]I+(H) O I+(d)l

iff (by cancelling q c l ( g ) and re-arranging) for all finite g C CI - {c'},

Qc/(c) - qcl(r _> ]I+(H) n I+(c)] - I I + ( g ) n I+(c')]

iff (by moving the quantifier for H to the right-hand side)

qCl(~) - qc~(r > (1)

max ( l I+ (H) n I+(r - 1I+(H) n I+(c ' )1) . HC_CI -{c'}

At this point, notice that ] I+(Y) n I+(c)I - I I+(g) n I+(c')l can be written as:

I I+(H) n ((I+(c) - I+(c ' ) ) U (I+(c) n I+(d)))]

]I+(H) n ( ( I+(c ') - I+(c)) u (I+(c) n I+(c')))I

which, by distributing n over U, is equal to:

I ( I+(Z) n (I+(c) - I+(c ' ) ) ) o ( I + ( H ) N I+(r n I+(r

I(I+(H) n (I+(c ') - I+(c)) ) u ( I + ( Y ) N I+(r n I+(c')) I

which--since I A U B I = I A] + IB[ for any two disjoint sets A and B-- is equal to:

II+(H) n (I+(c) - I+(c'))[ + II+(H) N I+(c) n I+(r

( II+(H) n (I+(c ') - I+(r

which, by simplification, is equal to:

II+(H) n (I+(c) - I+(e '))[

+ II+(H) n I+(r n I+(c')l )

- I I + ( H ) N ( I + ( c ') - I+(c) ) [ .

This expression is maximised when H = {c}. Therefore, continuing the proof from equation (1) with H = {c} gives

qc1(c) - qc1(c') > IZ+(r [z+(~) n I+(r

34 s. A n t h o n y a n d A. M. Frisch

iff (by the definition of quasi-compression)

IZ+(c)l- IZ-(c)l - I c l - QC1(c') >

I/+(c)l - IZ+(c) n I+(c')l iff (by subtracting IX+(c)l from both sides and re-arranging)

II+(c) n I+(c ' ) l - I I - (c ) l - Icl > QCI(Cr

T h e o r e m 3.4 For every problem instance I the superiority relation >z is a quasi-ordering over CI; that is, >_I is both reflexive and transitive.

P r o o f In the following, let c, c' and c" denote three clauses from CI.

Ref lex iv i ty : Clearly c >I c since

lI+(c) n I+(e)l - I I - ( c ) l - Icl > QCz(c).

Hence >z is reflexive. Trans i t iv i ty : Assume that c >I c' and c' > / c". We begin the proof

with the following two observations. Firstly, because c >z c',

II+(c) n I+(c')l- Iz-(c)l - Icl > I I+(c ' ) l - I ~ - ( d ) l - Ic'l iff (by re-arranging)

II+(c) n I+(c ' ) ] - ]I+(c')l >_ (2)

II-(c)1 + Ic l - I I - ( r Ic'l. Secondly, since I+ (c) n I+ (c ') C_ I+(c') , notice that:

I I+(c ' ) l - II+(c ') n I+(c")l >

II+(c) n I + ( c ' ) l - II+(c) n I+(c ') n x+(d')l iff (by re-arranging)

I/+(c) n I+(c ') N I+ ( c " ) ] - II+(c ') n I+(c")] > (3)

] I+(c ) n I+(c')l- II+(c')l. By transitivity, equations (2) and (3) entail:

II+(c) n I+(c') n I+(c")l - II+(c ') n I+(c")l >

II-(c)l + Ic l - I I - ( c ' ) l - Idl iff (by re-arranging)

II+(c) n I+(c ') n I + ( c " ) [ - I I - (c) l - Icl > (4)

II+(c ') n I+(c")l - ]I-(c')l - Ic'l.


Notice that, since c ~ _>j c", it follows that

II+(c ') n I§ 11-(c')I- Ic'l ___ QCdc"). Therefore, by transitivity, equations (4) and (5) entail:

II+(e) n I+(c ') n I + ( c " ) l - I I - (c ) l - Icl _> QCI(CH),

and (by removing [I+(c')[ from the left-hand side)

II+(c)nI+(c")l- II-(e)l- Icl _> Qc~(c"). Hence c >_I c", and so _>l is transitive.

(5)

Clause c E CI is said to be maximally superior for I if there is no clause c I E CI such that c ~ ~I c and c ~ I c ~. Hence, Theorem 3.3 tells us that any solution to a problem instance contains only maximally superior clauses. Furthermore, recall that cautious induction aims to find any one of the solutions of I. Therefore, in cases where two clauses are superior to each other at most one is needed in the candidate set.

From the results of this subsection we see that it suffices for the candidate set to be a complete incomparable set of compressive clauses, as defined below.

Def in i t ion 3.3 ( C o m p l e t e I n c o m p a r a b l e Set o f C o m p r e s s i v e Clauses) Let I be a problem instance. A set of clauses S C_ Cz is a complete incomparable set of compressive clauses with respect to I if it is

�9 C o m p l e t e : for any compressive clause c ~ E C•, there exists a clause c E S such that c _>I c r,

�9 I n c o m p a r a b l e : for any distinct c, c ~ E S, c ~ I c ~, and �9 C o m p r e s s i v e : S contains only compressive clauses for I.

Notice that by Theorem 3.2 this set of clauses is always finite.

3.2 P r u n i n g t he Search Space In searching the refinement graph for maximally superior compressive

clauses, the cautious refinement algorithm should attempt to determine when a clause's descendants need not be examined. Indeed this pruning is necessary to ensure that the search only considers a finite set of clauses even if the clause space is infinite. This subsection identifies two conditions that justify pruning a elause's descendants. The first condition is that if a clause is not D-compressive then i tsdescendants can be pruned; the justification, as proved below, is that in this case none of the descendants are compressive. The second condition is that the descendants of a clause c' can be pruned if the candidate set contains a clause c that is D-superior to cr; the justification, as proved below, is that in this case c is superior to every descendant of c'.

36 S. Anthony and A. M. Frisch

Def in i t ion 3.4 ( D - c o m p r e s s i o n ) Let I denote a problem instance and let c denote a clause from CI. The D- compression of c with respect to I, denoted DC1(c), is given by:

DCI(C) : II+(c)l- (lel + 1). We say that c is D-compresswe with respect to I if DCI(c) > O.

L e m m a 3.1 Let I denote a problem instance and c and c' two clauses from CI. If c I is a descendant of c then DCI(c) > DCI(c') + 1.

P r o o f From Proposition 2.1 we have l+(c) D I+(c ') and ]c I + 1 < Ic'[. Hence

]I+(c)[ > [I+(c')[, and (6)

-Icl- 1 > -Ic ' l . (7)

Adding the left-hand sides of equations (6) and (7) together and their right-hand sides together yields

II+(c)l- Icl- 1 > IZ+(c')l- Ic'l, which, by the definition of D-compression, gives

DCI(C) > DCI(C') + 1.

T h e o r e m 3.5 Let I denote a problem instance and c and c' two clauses from Cx. If c' is a descendant of c then DC/(c) > qc l (c ' ) .

P r o o f Let clause c' be a descendant of clause c. From Lemma 3.1 we know that DC1(c) > DCI(C') + 1. Observe that for any clause in general, and c' in particular, DCI(C') -~- 1 > qc l ( c ' ) . Thus, by transitivity, DCI(C) > qCl(C'). �9

Therefore, if a clause is not D-compressive none of its descendants are compressive and hence they can all be pruned.

Definition 3.5 below introduces D-superiority, which is followed by a theorem showing that if c is D-superior to c' then c is superior to all the descendants of e'.

Def in i t ion 3.5 ( D - s u p e r i o r i t y ) Let I denote a problem instance and let c and c' denote two clauses from CI. Clause c is D-superior to c' with respect to i , denoted c D I C I, if and only if

II+(c) n I + ( c ' ) [ - I1-(e)l- Icl > DcI (c ' ) .

T h e o r e m 3.6 Let I denote a problem instance and c and c' two clauses from Ct. If c DI C t

then for all descendants c" C CI of c', c > I c".


P r o o f Let c, c ~ and c" be three clauses in CI such that c" is a descendant of c ~ and c DI c'. By the definition of D-superiority

ll+(c) n l+(c')l- l-r-(c)l- Icl ___ DCl(C') and thus, by the definition of D-compression

I I+(c) n I+(c ' ) l - I I - ( c ) l - Icl > I I + ( c ' ) l - (Ic'l + 1)

iff (by re-arranging)

I I+(c) n I+(c ' ) l - I I+(c ' ) l _ I / - ( c ) l + l e l - (Ic'l + 1). (8)

By proposition 2.1, I+(c ') D I+(c"), and hence

I I+(c) n I+ (c" ) l - I I+(c") l ___ I /+(c) n I + ( c ' ) l - I I+(c ' ) l (9)

By transitivity, equations (8) and (9) entail:

I I+(c) n I+ (c" ) l - I I+(c") l _ I / - ( c ) l + I c l - (Ic'l + 1)

iff (by re-arranging)

[I+(c) nI+(c")[- I I - ( c ) [ - Icl > I I + ( c " ) l - (Ic'l + 1).

Again, by proposition 2.1, [c'[ + 1 < [e"[, and hence

II+(c) n / + ( c " ) l - I I - ( c ) l - Icl ___ I I+(c") l - Ic"l

and, if lI-(c")l is subtracted from the right-hand side,

]l+(c) n I+(c")l- ll-(c)l- Icl _> ll+(c")l- ll-(c")l- Ic"l. Hence, by the definition of superiority, c _>I CH, �9

Therefore, for a problem instance I, all the descendants of a clause c' E CI can be soundly pruned if either:

�9 DC1(d) < 0, since no descendant of c' can be compressive, or �9 the candidate set already contains a clause c E CI such that c [>I Cl.

3.3 The C a u t i o u s R e f i n e m e n t A l g o r i t h m This section examines our cautious refinement algorithm, which is shown

in Fig. 2. Given a problem instance I the algorithm performs a generic search through the refinement graph of Ci. During the search a set of candidate clauses is collected in Can, which is returned when the algorithm terminates. Lines (6) and (7) implement the two candidacy conditions--compressive and superior--and line (9) implements the two pruning conditions--D-compressive and D-superior.

The following two theorems establish the correctness of the algorithm. The first proves that the algorithm always terminates and the second proves that upon termination it returns a complete incomparable set of compressive clauses.


Input : a problem instance I = (79, B, T, B, IC, p> Output: a candidate set Can

(1) l e t Open = IC (2) l e t Can = 0 (3) while Open • 0 (4) l e t c' be an a r b i t r a r y member of Open (S) l e t Open = Open - {c'} (6) i f QCz(c') > 0 and Vc E Can c ~z c' then (7) l e t C a n = C a n - {c" I c" e Can and c' >_l C"} (8) l e t Can = Can U {c'} (9) i f DCI(C') > 0 and Vc E Can c ,b, lc' then

(10) l e t Open = Open U p(c') (11) end while (12) re tu rn Can

Fig. 2 A Cautious Refinement Algorithm

T h e o r e m 3.7 The algorithm in Fig. 2 terminates on all inputs.

P r o o f Let I be the problem instance that is input to the algorithm. If c is a clause in e l let INDEXI(C) be the greater of zero and DCI(C) -I- 1; if S is a set of clauses let INDEXI(S) be the multiset {INDEXI(C)IC E S} . On each iteration a D-compressive clause c ~ is removed from Open and either discarded or replaced with with all of its refinements. In particular, if INDEXI(e t) : 0 then c' is discarded. If c ' is replaced by its refinements, then INDEXI(C t) > 0 and, from Lemma 3.1, it follows that INDEXI(C t) is strictly greater than the INDEX/ of each refinement of c'. Hence on each iteration of the algorithm INDEXI(Open) decreases in the multiset ordering. Therefore, since the multiset ordering is well-founded, the algorithm terminates. �9

T h e o r e m 3.8 Given a problem instance I as input, the algorithm in Fig. 2 returns a finite complete incomparable set of compressive clauses for I.

P r o o f We shall prove each of the three conditions from Definition 3.3 in turn.

C o m p l e t e : First notice that the completeness of Can cannot be affected by line (7) since any time this line removes a clause from Can, a clause superior to it is added to C a n in line (8). Hence we consider the algorithm with line (7) removed. In this modified algorithm, elements are never removed from Can and hence the set grows monotonically. Therefore to show that the algorithm without line (7) produces a complete set of compressive clauses it suffices to show that if c is a clause in CI then either c is non-compressive or at some point during the execution of the algorithm Can contains a clause that is superior to c. Consider two cases:

�9 c is added to Open at some point in the execution of the algorithm: In this

Cau t ious Induc t ion 39

case, since the algorithm only terminates when Open is empty, c must be removed from Open. This removal must happen in line (5), whereupon the test in line (6) is performed immediately. Either

- the first part of the test fails: thus c is not compressive, - the second part of the test fails: thus Can contains a clause superior to

c, or - the test succeeds: whereupon c is added to Can and, because superiority

is reflexive, Can contains a clause superior to c.

�9 c is never added to Open: In this case, some ancestor of c was pruned by the test in line (9). Therefore, either the first part of the test succeeded-- thus, by Theorem 3.5, c is not compressive--or the second par t of the test succeeded--thus, by Theorem 3.6, Can contains a clause that is superior to C.

I n c o m p a r a b l e a n d C o m p r e s s i v e : At the s tar t of the algorithm Can is the empty set, and hence is incomparable and compressive. We now argue that both properties are maintained throughout the execution of the algorithm. Can is modified only in lines (7) and (8) of the algorithm. Line (7) only removes elements from Can, hence the set remains incomparable and compressive after this line is executed. Line (8) adds c' to Can. In this case

�9 c' is compressive since if it were not the test in line (6) would not allow tine (8) to be executed,

�9 there is no c E Can such that c _>I c' since if there were the test in line (6) would not allow line (8) to be executed,

* there is no c E Can such that c ~ > I c since any such c would be removed from Can in line (7).

Hence, Can remains incomparable and compressive after line (8) is executed. �9

{}4 CILS: A Cautious Induction System CILS is a top-down cautious induction ILP system that we have designed

and implemented in Yap Prolog (version 4). This section describes the implementation of each of the system's two stages.

4.1 C a u t i o u s R e f i n e m e n t in C I L S To ask CILS to solve a problem instance I = (P, B, T, B, IC, p), the user

directly specifies the first four elements of this six-tuple. CILS employs a system of usage declarations whereby a user can specify which of a fixed range of initial clauses and refinement operators the program is to use. These usage declarations are similar to the mode declarations employed by Progol. If CILS is provided with a finite set U of usage declarations, we write Pu and ICu to denote the refinement operator and initial clauses, respectively, that it uses. We first describe how pu is obtained from U and then how ICu is obtained.

Each usage declaration is an atomic formula that is ordinary in all respects

40 S. A n t h o n y a n d A, M. Frisch

except that it contains no variables other than the meta-variables Vi~put, Voutv~t and Const. An atom I is said to pattern conform to a usage declaration u if I can be obtained from u by replacing all occurrences of V/np~t and Vo~tv~t by object language variables and all occurrences of Const by object language constants. It is important to notice that this replacement differs from substitution in that different occurrences of the same meta-variable may be replaced by different object language symbols.

In refining clause c E CI, the literal -,l can be added only if I pattern conforms to some usage declaration u. Furthermore, in obtaining l by replacing the meta-variables in u, all occurrence of Vinp~t must be replaced by variables that occur in c. If this is the case, then we say that (l, c) I//0 conforms to u.

With this understanding of usage declarations, we now define Pu and ICu.

pu(c) = {c U {-~l)]-~l is a literal whose predicate symbol is in B and (10) -~l ~ c and for some u E U, I pat tern conforms to u and

(l, c) I /O conforms to u and if u contains "Const" then

I+(cU {~t)) # 0).

ICv ={l l l is a positive literal whose predicate symbol is 7 ) and (11) for some u E U, l pat tern conforms to u and

if u contains "Const" then I+({/) ) ~ 0).

Recall that the definition of the top-down ILP problem stipulates that in every problem instance, both the set of initial clauses and every set of refinements produced by the refinement operator are finite. Section 5 presents proofs that this is indeed the case for Pu and IC u.

For the most part, cautious refinement in CILS is a straightforward implementation of the algorithm in Fig. 2, where Pv is used as the refinement operator and ICu as the initial clauses. Although the algorithm in Fig. 2 is presented as a generic search procedure that arbitrarily choses nodes from Open, CILS implements this as a depth-first search. CILS allows the user to specify a maximum clause size, thus limiting the depth of this search. However, as is shown by Theorem 3.7, the search terminates even without imposing such a restriction.

CILS must compute coverage, and hence entailment, in order to compute quasi-compression, superiority, D-compression, D-superiority, pv and IC u. In all cases, CILS must test ~vhether a finite set of definite clauses entails a ground atom. This is done using SLD resolution in Prolog. However, this set of coverage tests is not decidable and the Prolog deduction mechanism is not guaranteed to terminate. So the best that can be done is to approximate the tests, and CILS does this by bounding the depth of the SLD search using a user-supplied bound.

According to the definition of p~ (c) in equation (11), refinements produced from a usage declaration u containing "Const" must cover at least one positive example. A straightforward algorithm for producing such refinements would generate all combinations of constants to replace the occurrences of Const in u and test that each resulting clause covers at least one positive example. In fact,


CILS does something much more efficient than this: it replaces each occurrence of Const in u with a distinct variable, producing a literal l', and then for each positive example e, constructs all SLD proofs of e from B U {c U {-~/'}}. From each successful proof, if any, a value for each constant position can be extracted by examining how the proof instantiated the variables of l'. A similar method is used for generating ICu.

The conditions in the definitions of Pu and I C u forbid the generation of clauses that cover no positive examples on the grounds that such clauses cannot be compressive. Without these conditions, such non-compressive clauses would get pruned elsewhere in the algorithm. The reason for including these conditions in the definitions is that their compliance can be guaranteed by generating constants with the algorithm of the previous paragraph.

4.2 Clause Selection in CILS The clause selection stage of CILS provides the user with a choice of two

weighted subset-cover algorithms for optimising a hypothesis quality function. At one extreme, a greedy approximation algorithm is provided, resulting in problems similar to those described in Section 1. At the other extreme, a complete algorithm is provided, which is guaranteed to construct a hypothesis of highest quality--a feat inherently unattainable by the architectures of other current ILP systems. However, this boast must be weighed against the following two objections.

1. Such a complete search is NP-complete. 2. The benefit of finding the best hypothesis depends greatly on the accuracy

of the hypothesis quality function used and the problem being solved. This is discussed in more detail in Section 6.

We plan to implement and experiment with other approximation algorithms that lie between these two extremes, but for the remainder of this paper we only consider using the greedy approximation algorithm.

w C o m p l e x i t y A n a l y s i s In order to examine the efficiency of cautious induction, and in particular

the efficiency of cautious refinement, this section compares the complexity of the refinement graph for a problem instance searched by Progol against that of CILS.

We begin by introducing Bell numbers, which are used in our analysis.

Definit ion 5.1 (Bell Numbers ) The Bell Number of m, written BELL(m), is the number of distinct partitions of m objects. BELL(m) is defined by the following expression.

- 1

BELL(m)=,i__~0 (m~- I ) BELL(i) otherwise

42 S. A n t h o n y a n d A. M. F r i sch

5.1 T h e P r o g o l A l g o r i t h m Progol is widely regarded as a current state-of-the-art ILP learning algo-

rithm. Although a clause-at-a-time hypothesis construction algorithm, given a problem instance I = (7 9, B, T, B, IC, p), Progol does perform a complete search of a finite subset of the refinement graph for I.

To avoid generating clauses which cover no positive examples and also to avoid redundancy, Progol bounds each refinement graph for I from below by a most specific clause, l , constructed from a single positive example e E T +. The clause I is required to cover the positive example e used in its construction, and, since every refinement must subsume l , they will all cover e. This is shown in Fig. 3. Furthermore, each refinement preserves the ordering of literals in the body of I (a total order) which ensures that redundancy is avoided. Thus Progol uses two algorithms during learning: one to construct L, and one to search the resulting refinement graph.

[ 1 ] C o n s t r u c t i o n of l The clause l is constructed to be most specific, and therefore contains all

the literals that may appear in any clause in the refinement graph. The target predicate and each background predicate is accompanied by a finite set of mode declarations, each of which consists of

�9 a usage declaration for the predicate, and �9 a positive integer, known as the recall.

Given a literal l which satisfies a usage declaration, the recall determines how many solutions of l are found, and each is added as a separate literal to the body of I . Thus, for recall r, all possible literals that satisfy a usage declaration are each added at most r times to l , with a bound placed on variable depth to ensure that this process terminates.

[ 2 ] Search by R e f i n e m e n t A complete top-down A* search of each refinement graph for I is per-

formed�9 Since any clause in a refinement graph must subsume l , the literals in the body of these clauses must be a finite subset of those in the body of _l_.

.0.

Refinement , graph ~ .

",• Examples

\/ \/ N / \/ ~ \ / A /\ A / \ A A

e

\/ \/ V \/ A /\ /\ /\

Fig. 3 A refinement graph for Progol, bounded from above and below

C a u t i o u s Induc t ion 43

However, notice that if a binding between variables in a literal in _l_ is split when that literal is added to a clause, the resulting refinement will still subsume _l_. Progol allows variable bindings to be split in this way, provided that each resulting literal satisfies a usage declaration for its predicate. Therefore, Progol's refinement operator passes left to right over the literals in the body of _l_. For a particular literal -,l and clause c being refined, the set of refinements of c consists of clauses of the form cU {--,1'}, where each -,I' is a copy of --,1 having a different split of the variable bindings in its arguments. The search terminates when the provably best clause in the graph has been found.

[ 3 ] C o m p l e x i t y Ana lys i s The following two theorems are based upon Theorems 26 and 39 of Mug-

gleton 4) and determine the complexity of Progol's bounded refinement graph. Muggleton's results place upper bounds on the number of Vinp,,t and Vo~,tput meta-variables that may appear in any usage declaration, denoted by j+ and j - respectively. We replace both bounds with a single constant j , where j+ < j and j - < j, in order to ease our presentation.

T h e o r e m 5.1 (Size C o m p l e x i t y o f _l_) Let ]UI denote the cardinality of a finite set of usage declarations U, and let r denote an upper bound on the recall associated with these declarations. Further- more, let i denote a teacher specified upper bound on the depth of variables in _k, and let j denote an upper bound on the number of positions in the arguments of any predicate. The size of • is upper bounded by

( r . IVl . j)2ij .

T h e o r e m 5.2 (Size C o m p l e x i t y o f P r o g o l ' s R e f i n e m e n t G r a p h ) Let I_l_] be the size of_l_ given in Theorem 5.1, and let j denote an upper bound on the number of positions in the arguments of any predicate. The size of Progol's refinement graph to depth s is upper bounded by

I_L l ' . j - s j .

Notice that, in the worst case, Progol must construct J_ and search the entire refinement graph once for each positive example in the training sample.

[4] Criticisms of Progol Whilst the use of _l_ ensures that each refinement covers at least one pos-

itive example, a number of other problems are introduced. We outline two that are of particular interest to us here.

�9 Any constant symbol in any clause from a refinement graph must be present in the corresponding literal in _l_. Thus all constant symbols must be deter- mined from a single positive example.

�9 If some examples in the training sample are misclassified, .k may be constructed from a misclassified positive example, possibly adding an unwanted

44 S. Anthony and A. M, l~risch

clause to the hypothesis and adversely affecting the remainder of the hypothesis ' construction.

5.2 T h e C I L S A l g o r i t h m We establish an upper bound on the sets produced by CILS' refinement

operator, and then determine the complexity of CILS' refinement graph.

T h e o r e m 5.3 ( T h e C a r d i n a l i t y o f ICu) Let I = (7 9, B, T, B, IC, p) denote a problem instance and U a finite set of usage declarations, each of which has predicate 79 and at most j occurrences of Vo~,tp~,t meta-variables. The cardinality of ICtr is upper bounded by

BELL(j). IT+I �9 IUI.

P r o o f Consider the j possible occurrences of Vo~tp~t as a set S and object language variables as partitions of S. Up to variable renaming, there are BELL(j) distinct partitions of S. Thus, for each usage declaration u E U, there are at most

1. BELL(j) combinations of object variables that can replace the occurrences of Voutput meta-variables in u.

2. IT+I tuples of constants that can replace the occurrences of Const meta- variables in u, since CILS' constant selection algorithm uses each positive example to find possible constants.

Since there are IUI usage declarations, the cardinality of ICu is upper bounded by

BELL(j)" IT+I �9 IUI.

T h e o r e m 5.4 ( C o m p l e x i t y o f C I L S ' R e f i n e m e n t O p e r a t o r ) Let I = (79, B , T , B , I C , p) denote a problem instance and U a finite set of usage declarations, each of which has a predicate symbol from B and at most j occurrences of meta-variables. Furthermore, let K denote the set of all constants that occur in either T + or B. For any clause c E CI, the cardinality of pv(c) is upper bounded by

max(((Icl. + 1). j)~, IKI~') �9 IUI .

Proof Each of the Icl literals in c contains at most j distinct variables and so there are at most Icl . j variables in c. For each usage declaration u E U, there are at most

1. j occurrences of ~,~p~t meta-variables in u. These can be replaced in at most (Icl, j)J distinct ways by variables already present in c.

2. j occurrences of Vo~,tp~,~ meta-variables in u. These can be replaced by either variables already present in c, or by any one of at most j new variables.

Caut ious Induct ion 45

Hence, there are at most (Icl" j + j)J = ((Icl + 1). j)J ways in which this replacement may occur.

3. j occurrences of Const meta-variables in u. These can be replaced in at most IKV ways by constant symbols from the positive examples or the background knowledge.

Either expression 2 or 3 above will dominate the cardinality of pu(c). Therefore, since there are IUt usage declarations, the cardinality of p~r(c) is upper bounded by

max(((Icl + 1). j)J, IKIJ). IuI.

T h e o r e m 5.5 (Size C o m p l e x i t y o f CILS ' R e f i n e m e n t G r a p h ) Let I, U, j and K be defined as in Theorem 5.4. The size of CILS' refinement graph, up to and including depth s, is upper bounded by

s - ( m a x ( s , j , II~'1)" IUt) ' j ,

P r o o f For branching factor bi at depth i, it can be seen that the number of nodes at a given depth n (n > 0) is upper bounded by (bn_l) n. Hence the total number of nodes in the refinement graph, up to and including those nodes at depth n, is upper bounded by n. (b,~_l) n. Furthermore, notice that the number of nodes of depth 1 is precisely the cardinality of ICu. Therefore the size of CILS' refinement graph, up to and including depth s, is upper bounded by

s . (BELL(j). t T + I - IUI)-(max((s. j)J, I lqJ) . IUI) "-1

The I T + I positive examples used to find constants for the clauses in ICy will find only a subset of the combinations given by IK] j . Therefore, from Theorems 5.3 and 5.4, the above expression is upper bounded by

s . (max((s. j)J, I K t J ) �9 IUl)'

which, by collecting the exponents together, is upper bounded by

s . (max(s. j, IKI) ' IuI) ~ .

The parameter s is either a user-supplied restriction on clause size, or, if no such restriction is imposed, upper bounded by IT + I.

5.3 D i s c u s s i o n The expression bounding the size of CILS' refinement graph is polynomial

in IKI, the set of all constants appearing in the positive examples or the background knowledge. For usage declarations with at most j occurrences of Const meta-variables, CILS' constant selection algorithm uses each positive example in turn to find possible constants, eventually generating a subset of K. Hence,


the size of CILS' refinement graph is polynomial in the number of positive examples in the training sample. In contrast, Progol uses the recall r of each usage declaration in the construction of .l_ to generate r potential tuples of constants.

Furthermore, notice that the complexity of CILS' refinement graph is of a similar order to that of Progol. In other words, both are exponential in j and s, polynomial in IU], and where CILS is polynomial in IK I, Progol is polynomial in r. However, although CILS' refinement graph is likely to be larger in general, CILS only searches one such graph. The time required by Progol is dependent on the number of refinement graphs constructed and searched. In other words, Progol runs in time proportional to the number of clauses in the hypothesis whilst CILS runs in time independent of this. Hence, CILS appears well suited to learning hypotheses that contain many clauses.

w Empirical Analysis This section seeks to answer the following two questions by comparing

CILS against Progoh

1. Does CILS construct significantly higher quality hypotheses from the examples in the training sample?

2. Are these hypotheses significantly more accurate in classifying new examples from a test sample?

In 1995, Srinivasan et al. ~) reported a series of experiments studying the performance of Progol (version 2.1) in constructing hypotheses for mutagenesis prediction. In particular, they investigated how increasing the amount of available background knowledge affected the quality and predictive accuracy of hypotheses. Since our aim is to explore the differences between clause-at-a-time hypothesis construction and cautious induction using Progol and CILS, this section repeats and extends these experiments to include CILS.

6.1 The Dataset The mutagenesis problem is to predict the mutagenic activity of nitroaro-

matic compounds. A source of 230 of these compounds have been classified as either active or inact ive using a procedure known as the Ames test. This test is not a perfect classifier, and hence the training sample provided to the learners contains some classification noise.

Chemists identify each of these 230 compounds as belonging to either a set of 188 "regression-friendly" compounds, or a set of 42 "regression-unfriendly" compounds. For the purposes of this study, Srinivasan et al. ~) restrict their attention to the 188 regression-friendly compounds, of which 125 are active (serving as positive examples) whilst the remaining 63 are inactive (serving as negative examples). The task facing each learner is to construct a hypothesis that explains the Ames test classification in terms of the available background knowledge. Four sets of experiments were conducted using the background knowledge described in Table 2.


Table 2 The background knowledge used in each of the four experiments Stage [ Background Knowledge Content

b0 atom and bond structures of the compounds

bl

b2

b3

No. of atoms 10138

and numerical inequalities definitions in b0, and 10598 two indicator variables definitions in bl, and 10974 two relevant bulk properties definitions in b2, and 12686 generic 2-D structural templates

Table 3 The parameter settings used by the learning systems Algorithm Max. variable depth Max. clause size Max. clause noise Progol 2 3 100 C I L S - 3 -

6.2 T h e A l g o r i t h m s a n d T h e i r P a r a m e t e r S e t t i n g s Recall that Progol is a clause-at-a-time hypothesis construction algorithm

which, like CILS, a t tempts to find a hypothesis with maximal quasi-compression. Ordinarily, Progol requires hypotheses to correctly classify every example in the training sample. However, in order to handle training samples containing possible misclassification, a noise parameter is used, specifying an upper bound on the number of negative examples that each clause may cover. Each positive example must still be classified by a hypothesis as positive. In these experiments, Progol's noise parameter was set to 100. This number is greater than the number of negative examples and thus allows Progol to assess the quality of clauses in the same way as CILS.

Table 3 shows the various parameters that are used in the experiments to place upper bounds on certain properties of the hypothesis. Finally, it must be stressed that CILS used a greedy weighted subset-cover algorithm to construct hypotheses in these experiments.

6.3 Experimental Procedure Following Srinivasan et al., 7) the examples were divided into ten partitions

of approximately equal size. For each set of background knowledge b0, . . . , b3, the following 10-fold cross-validation experiment was conducted.

1. Withdraw one partition for use as a test sample. 2. Construct a hypothesis using the remaining nine partitions as the training

sample. 3. Use this hypothesis to predict the classifications of the examples in the test

sample.

Notice that each example appears in nine training samples and one test sample during the ten folds. Thus the classification of each example is predicted precisely once in each 10-fold cross-validation experiment. The hypotheses produced were assessed using the three measures below.

48 S. Anthony and A. M. l~risch

T a b l e 4 A cont ingency table for e s t i m a t i n g each learner ' s predict ive accuracy Ames test result

Act ive Inact ive n l n2

Act ive Learner ' s

Predic t ion n3 n4 Inac t ive

T a b l e 5 A cont ingency table for M c N e m a r ' s tes t of differences L1 's predic t ion

Correct Incorrect n l n2 4

Correct N ~_~ ni L2 's p red ic t ion n 3 n4 i=l

Incorrect

�9 T i m e Taken : The absolute time taken by the learners to construct their final hypotheses was recorded since both learners are implemented in Yap Prolog (version 4) and all the experiments were executed on the same Silicon Graphics 02 workstation.

�9 H y p o t h e s i s Size: The total number of literal occurrences in the hypothesis was recorded.

�9 P r e d i c t i v e A c c u r a c y of T e s t E x a m p l e s : For each set of background knowledge, the predictive accuracy of a learner's hypotheses is estimated by comparing the predicted classifications of the test examples from each of the 10 folds against their actual classification by the Ames test. Therefore, each example falls into one of four possible categories as shown in Table 4. Denoting the number of examples falling in each category as n i (1 < i < 4), estimated predictive accuracy and the expected error of this estimate are given by:

predictive nl h- n4 expected = ~ , / / - ~ - P) accuracy P - N error V

Now, let us consider comparing the classifications of two learning systems, denoted L1 and L2. Again, we shall let ni denote the number of examples falling into each of the four categories and construct the contingency table shown in Table 5. Note that a prediction for an example is considered to be c o r r e c t if it agrees with the Ames test result, and i n c o r r e c t otherwise. McNemar's test 3) is used to test the null hypothesis that L1 and L2 have the same error rate, against the alternative hypothesis that there is a significant difference in the two learner's performances.

6.4 R e s u l t s a n d D i s c u s s i o n The results of the experiments are given in Table 6. Each row gives one of

the measures used to assess the hypotheses produced by either Progol or CILS. Each column gives the average scores of the hypotheses constructed in the 10-fold cross validation experiment for a particular set of background knowledge.


Table 6 Experimental results

Assessment Measure

Background Knowledge

Does Test Accuracy Differ Significantly?

Learner b0 ] bl [ b2 [ b3

Time (sees) Progol 2.6 108.7 142.2 162.5 CILS 1972.7 1931.5 2118.4 2380.4

Training Progol 66.5 86.8 92.9 92.9 Accuracy (%) CILS 75.8 96.3 97.3 97.3 Hypothesis Progol 1.0 12.7 15.7 16.1 Size CILS 6.7 18.1 16.5 16.5 Test Progol 66.5 (0.03) 82.4 (0.03) 85.1 (0.03) 84.0 (0.03)' Accuracy (%) CILS 70.2 (0.03) 88.8 (0.02) 87.2 (0.02) 87.2 (0.02)

No Yes No No

Perhaps the most striking result is the average time taken and the average size of the hypotheses constructed by Progol for background knowledge b0. In each of the 10-folds, Progol returned the single clause hypothesis that every compound is active. For an explanation, recall that approximately two-thirds of the examples in the training sample are positive, and one-third negative. Therefore, the clause stating that every compound is active does, at first glance, appear to be a good quality clause. For background knowledge b0, it is frequently the highest quality clause in Progol's refinement graph, and thus Progol includes it in a hypothesis. CILS avoids this problem for the following two reasons:

1. Possible constants for clauses are selected using all the positive examples in the training sample, and not just a single positive example as in Progol. Thus, CILS refinement graph contains clauses that are absent from Progol's.

2. All superior candidate clauses are found before any commitments are made as to which clauses should appear in the hypothesis.

These two work in combination to allow CILS to locate some higher quality individual clauses, and also to combine them effectively.

As the background knowledge was expanded, several single high quality clauses become apparent. However, in a few of the folds, Progol initially selected a positive example to construct _l_ that was not covered by any of these high quality clauses. Again, the best clause that remained was the most general clause, which became the overall hypothesis. This behaviour became less frequent as the background knowledge was expanded, but does explain why Progol's hypotheses are consistently smaller than CILS' hypotheses whilst their classification accuracy of the training examples is consistently worse.

Another important result is that, for background knowledge bl, the predictive accuracy on the test examples of CILS' hypotheses was significantly higher than Progol's hypotheses. The predicates added to b0 to produce bl allow a single very high accuracy clause to stand out from the other possibilities. However, in 3 of the l0 folds for bl, Progol initially selected a positive example that was not covered by this clause, and as a result these 3 hypotheses contained only the most general clause. CILS was also able to find a better combination of other clauses to cover the remaining positives, explaining the difference in performance. It is


also worth noting that CILS' performance can be further improved by using a complete weighted subset-cover algorithm to construct hypotheses from the set of superior candidate clauses.

A final observation is that CILS was significantly slower than Progol in constructing its hypotheses. This is the result of three factors:

1. With the clause space generally containing several clauses of outstanding quality, Progol was able to quickly locate the best one of these with respect to _l_ in each refinement search.

2. Many of the literals in the clauses contained several constants. Since each of the compounds consisted of around 20 atoms, much of CILS' search time was spent finding possible constants--a weakness of CILS, identified in Subsection 5.3

3. Only a small number of clauses are required in the final hypothesis. In all the experiments, very few hypotheses contained more than 4 candidate clauses (the remainder being positive examples). Thus, Progol did not need to search many refinement graphs.

{}7 Related W o r k We are aware of one other related learning algorithm--CLAUDIEN. At

face value, the process of Clausal Discovery in the CLAUDIEN system ~) has little relation to cautious induction. Learning from interpretations, CLAUDIEN constructs hypotheses with an aim to describe what is in the training sample, rather than to classify unseen examples. However, a closer inspection of the two approaches reveals certain similarities between CLAUDIEN and CILS. Although using different search techniques, the hypotheses produced by CLAUDIEN are similar in spirit to the finite complete incomparable set of compressive clauses found during cautious refinement. Exploring this similarity may allow the con- sequences of using a different learning setting, aim and search technique to be assessed.

w Conclusions Addressing the problem of hypothesis construction, we have developed

cautious induction, an approach that clearly distinguishes between searching for candidate rules and selecting candidates to form a hypothesis. An initial implementation of cautious induction as a top-down ILP system called CILS, compares favourably with a state-of-the-art top-down ILP system, Progol. In particular, this comparison has highlighted the applicability of cautious induction to problems involving

�9 hypotheses that contain many clauses, since CILS runs in time independent of the number of clauses in a hypothesis, and

�9 datasets containing misclassification, since the search for clauses is complete but not centred on a single, possibly misclassified, positive example.

Despite these encouraging initial results there remain two significant short-


comings in CILS that need to be addressed, the first of which is redundancy. This can be avoided if a total ordering is defined over the literals that may appear in the bodies of clauses. Such an ordering is provided by Progol's most specific clause _l_, although we would prefer a mechanism that is not centred on a single positive example. The second shortcoming is CILS' constant selection algorithm. Employing stochastic sampling of possible constants would remove CILS unwanted linear relationship between training sample size and execution time.

Acknowledgment We thank Alistair Willis, Stephen Muggleton and David Page for their

comments and suggestions concerning this work. We also thank James Cussens and Ashwin Srinivasan for their help with the experiments, and Stephen Watkin- son and Sue Brassington for patiently reading several drafts of this paper. We are grateful for the thorough reports provided by the anonymous referees. The first author is funded by an EPSRC studentship.

R e f e ~ n c e 8 l) Cormen, T. H., Leiserson, C. E. and Rivest, R. L., Introduction to Algorithms,

MIT Press, 1990.

2) De Raedt, L. and Dehaspe, L., "Clausal Discovery," Technical Report CW 238, Department of Computing Science, K. U. Leuven, 1996.

3) Dietterich, T. G., "Statistical Tests for Comparing Machine Learning Algo- rithms," to appear in Neural Computation, 1998.

4) Muggleton, S., "Inverse Entailment and Progol," New Generation Computing 12, pp. 245-286, 1995.

5) Muggleton, S. and De Raedt, L., "Inductive Logic Programming - Theory and Methods," Journal of Logic Programming 19-20, pp. 629-679, 1994.

6) Quinlan, J. R., "Learning Logical Definitions from Relations," Machine Learn- ing 5, pp. 239-266, 1990.

7) Srinivasan, A., Muggleton, S., Sternberg, M. J. E. and King, R. D., "Compar- ing the Use of Background Knowledge by Inductive Logic Programming Sys- tems," Technical Report PRG-TR-9-95, Oxford University Computing Labora- tory, 1995.

52 S. Anthony and A. M. Frisch

S i m o n A n t h o n y , B E n g . : Simon, perhaps better known as "Mr. Cautious" in Inductive Logic Programming (ILP) circles, completed a BEng in Information Engineering at the University of York in 1995. He remained at York as a research student in the Intelligent Systems Group. Concentrating on ILP, his research interests are Cautious Induction and developing number handling techniques using Constraint Logic Programming.

A l a n M. Fr i sch , P h . D . : He is the Reader in Intelligent Sys- tems at the University of York (UK), and he heads the Intel- ligent Systems Group in the Department of Computer Science. He was awarded a Ph.D. in Computer Science from the Univer- sity of Rochester (USA) in 1986 and has held faculty positions at the University of Sussex (UK) and the University of Illinois at Urbana-Champaign (USA). For over 15 years Dr. Frisch has been conducting research on a wide range of topics in the area of automated reasoning, including knowledge retrieval, probabilis- tic inference, constraint solving, parsing as deduction, inductive logic programming and the integration of constraint solvers into automated deduction systems.

cautious induction: an alternative to clause-at-a-time hypothesis construction in inductive logic...

Documents