prior knowledge and autonomous learning

15
Robotics and Autonomous Systems 8 (1991) 145-159 145 North-Holland Prior knowledge and autonomous learning Stuart J. Russell Computer Science Division, University of California, Berkeley, CA 94720, USA Abstract Russell, S.J., Prior knowledge and autonomous learning, Robotics and Autonomous Systems, 8 (1991) 145-159. This paper is concerned with the construction of autonomous learning agents, in particular those that use existing knowledge in the pursuit of new learning goals. Inductive learning has often been characterized as a search in a hypothesis space for hypotheses consistent with observations. It is shown that committing to a given hypothesis space is equivalent to believing a certain compact, first-order sentence. The process of learning a concept from examples can therefore be implemented as a derivation of the appropriate sentence corresponding to a hypothesis space for the goal concept, followed by a first-order deduction from this sentence and the facts describing the instances. At any point during the process, standard inductive methods can be used to select among the remaining hypotheses. Thus, by applying prior knowledge to the process of deriving a hypothesis space, the system is able to learn autonomously from a reasonable number of examples. Keywords: Determination; Tree-structured bias; Hypothesis space; Knowledge-guidedlearning; Induction; Machine learning; Logic. 1. Introduction The RALPH (Rational Agents with Limited Performance Hardware) research project, cur- rently being conducted at the University of Cali- fornia at Berkeley, has as one of its aims the provision of an autonomous learning capability for situated agents. This has of course been a dream of artificial intelligence researchers for quite some time, but recent conceptual and theoretical developments give one reason to believe that at least partial success is close at hand. Stuart Russell was born in 1962 in Portsmouth, England. He received his B.A. with first-class honours in Physics from Oxford University in 1982, and his Ph.D. in Computer Science from Stanford in 1986. He is currently on the faculty of the Computer Science Division of the University of Cali- fornia at Berkeley. His research inter- ests include machine learning, limited rationality, real-time decision-making, game-p.laying, and representation and reasomng in common-sense domains. In 1990 he received the NSF Presidential Young Investigator Award. His hobbies include long walks on deserted coasts and looking after stray cats. We begin by indicating some of the require- ments for autonomous learning, and our assump- tions about the architecture of the agent within which the learning takes place. We then focus on the vital role played by prior knowledge in each learning episode. We develop a formal basis for using prior knowledge in learning, and sketch the design of a learning system with these capabilities. A system is autonomous to the extent that its behaviour is determined by its immediate inputs and past experience, rather than by its designer's. A system that operates on the basis of built-in assumptions will only operate successfully when those assumptions hold, and thus lacks flexibility. A truly autonomous system should be able to operate successfully in any universe, given suffi- cient time to adapt. We don't wish to equate autonomous systems with tabula rasa systems, however, since this seems a somewhat impractical way to proceed. A reasonable halfway-point is to design systems whose behaviour is determined in large part, at least initially, by the designer's knowledge of the world, but where all such as- sumptions are as far as possible made explicit and amenable to change by the agent. This sense of 0921-8830/91/$03.50 © 1991 - Elsevier Science Publishers B.V. All rights reserved

Upload: stuart-j-russell

Post on 26-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prior knowledge and autonomous learning

Robotics and Autonomous Systems 8 (1991) 145-159 145 North-Holland

Prior knowledge and autonomous learning

S t u a r t J. R u s s e l l

Computer Science Division, University of California, Berkeley, CA 94720, USA

Abstract

Russell, S.J., Prior knowledge and autonomous learning, Robotics and Autonomous Systems, 8 (1991) 145-159.

This paper is concerned with the construction of autonomous learning agents, in particular those that use existing knowledge in the pursuit of new learning goals. Inductive learning has often been characterized as a search in a hypothesis space for hypotheses consistent with observations. It is shown that committing to a given hypothesis space is equivalent to believing a certain compact, first-order sentence. The process of learning a concept from examples can therefore be implemented as a derivation of the appropriate sentence corresponding to a hypothesis space for the goal concept, followed by a first-order deduction from this sentence and the facts describing the instances. At any point during the process, standard inductive methods can be used to select among the remaining hypotheses. Thus, by applying prior knowledge to the process of deriving a hypothesis space, the system is able to learn autonomously from a reasonable number of examples.

Keywords: Determination; Tree-structured bias; Hypothesis space; Knowledge-guided learning; Induction; Machine learning; Logic.

1 . I n t r o d u c t i o n

The R A L P H (Rat ional Agents with Limited Performance Hardware) research project, cur-

rently being conducted at the Universi ty of Cali- fornia at Berkeley, has as one of its aims the

provision of an au tonomous learning capabil i ty for si tuated agents. This has of course been a

dream of artificial intell igence researchers for quite some time, but recent conceptual and theoretical developments give one reason to believe that at least partial success is close at hand.

Stuart Russell was born in 1962 in Portsmouth, England. He received his B.A. with first-class honours in Physics from Oxford University in 1982, and his Ph.D. in Computer Science from Stanford in 1986. He is currently on the faculty of the Computer Science Division of the University of Cali- fornia at Berkeley. His research inter- ests include machine learning, limited rationality, real-time decision-making,

• game-p.laying, and representation and reasomng in common-sense domains.

In 1990 he received the NSF Presidential Young Investigator Award. His hobbies include long walks on deserted coasts and looking after stray cats.

We begin by indicat ing some of the require- ments for au tonomous learning, and our assump-

tions about the architecture of the agent wi thin • which the learning takes place. We then focus on

the vital role played by prior knowledge in each learning episode. We develop a formal basis for

using prior knowledge in learning, and sketch the

design of a learning system with these capabilit ies.

A system is au tonomous to the extent that its behaviour is determined by its immedia te inputs

and past experience, rather than by its designer's. A system that operates on the basis of bui l t - in assumptions will only operate successfully when those assumptions hold, and thus lacks flexibility. A truly au tonomous system should be able to operate successfully in any universe, given suffi- cient time to adapt. We don ' t wish to equate au tonomous systems with t abu la rasa systems, however, since this seems a somewhat impract ical

way to proceed. A reasonable hal fway-point is to design systems whose behaviour is de termined in

large part, at least initially, by the designer 's knowledge of the world, bu t where all such as- sumpt ions are as far as possible made explicit and amenable to change by the agent. This sense of

0921-8830/91/$03.50 © 1991 - Elsevier Science Publishers B.V. All rights reserved

Page 2: Prior knowledge and autonomous learning

146 s.J. Russell

autonomy seems also to fit in reasonably well with our intuitive notions of intelligence.

For a notion of learning in such systems, the following definition seems acceptable: learning takes place when the system makes changes to its internal structure so as to improve some metric on its long-term future performance, as measured by a fixed performance standard (cf. Simon's defini- tion in [64]). It also seems clear that the perfor- mance standard must ultimately be externally im- posed [5], particularly since, for the purposes of building useful artifacts, modification of the per- formance standard to flatter one's behaviour does not exactly fit the bill.

The inputs to an autonomous learning agent must be quite restricted. There are three 1 essential aspects of experience: (1) Perceptions that reflect the current state of the

environment. (2) Perception of the agent's own actions. (3) Information as to the quality of the agent's

performance. The agent's perceptions may be partial, intermit- tent and unreliable. The truth of the agent's per- ceptions is irrelevant (or, to put it another way, each perception carries a guarantee of its own truth). What is important is that the perceptions be faithful in the following sense: there is a con- sistent relationship between the agent's percep- tions and the performance feedback. The relation- ship can be arbitrarily complex and uncertain - the more so, the more difficult the learning prob- lem. Beyond this, it doesn't matter what the per- ceptions signify; autonomous learning can take place in real or simulated environments, or in the proverbial vat.

Any attempt to provide a learning capability must make some assumptions about the execution architecture of the agent: what is the structure of the part that actually chooses the actions? Here we borrow another argument from Simon [63]. If we view learning as searching the space of possible selves for an optimal configuration, then im- mediately there is a complexity problem: for agents

of non-trivial complexity (e.g., a simple computer with 1MB of memory), the space of possible con- figurations is absurdly large. Without further con- straints, learning will fail. This was the case in the early program mutation experiments of Friedberg and others [14,15]. The complexity argument goes as follows: The operation of the learning compo- nent effects changes on parts of the agent's inter- nal structure (one might take this as a definition of 'parts'). Suppose we have a simple system with 1,000 parts, each of which can be in 10 states. If we can find a global optimum by independently optimizing each of the 1,000 parts, then the search will take on the order of 10,000 steps; on the other, more horrendous, hand, if the parts are not independently optimizable the search will take on the order of 10 l°°° steps. A system composed of beliefs, by which is meant a system whose self- modification operations can be viewed as belief revision or acquisition, 2 can optimize its parts independently, since, to put it simply, making a belief truer must improve the performance of the system. Given the basic categories of inputs listed above, some obvious candidates for beliefs would include beliefs about the state of the world, beliefs about the effects of actions and beliefs about the relationship between the state of the world and the level of performance quality feedback. More 'compiled' components are also possible [59]. Thus it seems that an essential aspect of learning is the ability to acquire general beliefs, or universals, since only these allow extension of past experience to future performance. In artificial intelligence, acquisition of universals is commonly called con- cept learning, and it is on the problem of autono- mous concept learning that this paper will focus.

2. Learning and prior knowledge

The object of concept learning is to come up with predictive rules that an intelligent agent can use to survive and prosper. For example, after being 'presented' with several instances, an agent

i One might argue that perception of the agent's internal computations is also necessary for certain kinds of learning; these can be included in the 'environment' and 'actions'.

2 This is a fairly broad definition, since it includes, for exam- ple, the creation of direct, gated links from sensors to effec- tors, these being viewed as beliefs about the conditional optimality of a certain action.

Page 3: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning 147

might decide 3 that it needed to discover a way to avoid being eaten, and eventually learns that large animals with long, pointy teeth and sharp claws are carnivorous. It is now fairly well accepted that the process of learning a concept from examples can be viewed as a search in a hypothesis space (or version space) for a concept definition consistent with all examples 4, both positive and negative [36,1]. The original inputs to this process are epi- sodes in full Technicolor, that is, sensory-level data. There may be some automatic bottom-up processing to establish more abstract descriptions, but the scenes are still extremely rich, and the agent may have much more information besides that which is immediately obvious from its sensors (such as that today is Tuesday). The agent's job is to form a mapping from this input data to an effective response, such as Run Away. Suppose that the agent begins with no prior constraints on what constitutes an appropriate mapping. Then basic results in theoretical machine learning [70] tell us that the agent will need to see an awful lot of examples in order to learn an appropriate re- sponse, and may in fact perish before learning to run away (or, equivalently, perish by continually fleeing from otherwise edible objects).

Current learning systems 'solve' this fundamen- tal problem by being given a highly restricted hypothesis space and highly abstracted instance descriptions carefully designed by the programmer for the purposes of learning the concept that the programmer wants learnt. The job of the learning program under these circumstances is to 'shoot down' inconsistent hypotheses as examples are analysed, rather like a sieve algorithm for finding prime numbers. In practice this task requires some extremely ingenious algorithms, but it is only one aspect of the whole learning problem. We need systems that can construct their own hypothesis

3 The subject of the generation of goals for learning, though an important one, is not specifically addressed in this paper, although it forms part of the work on the RALPH project.

,s Consistency with all examples is in fact an over-strict crite- rion; in domains containing noise, it has been shown by Quinlan [48] that better predictive performance is obtained by allowing some inconsistency, to avoid overfining the data. Exact consistency can also result in a very complex theory which is extremely hard to use in practice, whereas a simpler but incorrect theory can give better overall performance.

spaces and instance descriptions, for their own goals. Bundy's charge [6] is worth repeating:

'Automatic provision ... of the description space is the most urgent open problem facing automatic learning.'

Consider the rather large space of all possible hypotheses about when to run away that are de- finable on the agent's sensory-level instance de- scriptions. Any effective method of autonomous learning must allow the agent to discard some of these hypotheses for reasons other than simple consistency with the instances for this particular learning problem. Since the discarded hypotheses are factual statements, contingent constraints on the world, then discarding the right hypotheses must reflect knowledge on the part of the agent, either explicit or implicit. Discarding hypotheses at random (randomly, that is, with respect to their truth) would not improve the agent's learning abilities at all. Using prior knowledge to learn thus seems to be the only way to acquire complex skills quickly. Knowledge-free learning must have happened once in order to get things 'off the ground', but it seems that this is a rather less interesting, special case.

Prior knowledge can be used in the following simple way: the agent can generate all possible hypotheses expressible in terms of the primitive language, and test them for consistency with its prior knowledge. In this way the number of exam- ples needed for learning will be reduced, but the agent will still be faced with an absurd amount of computation. What is needed is a more structured approach so that the agent can begin with a goal, the concept to be learned, and use its prior knowl- edge to construct a restricted hypothesis space and an appropriately abstracted instance description language. This paper summarizes and extends re- search developed in [53,55,57] that aims at solving this problem.

We can summarize the relationship between the kind of learning proposed herein and the other major lines of learning research. There appear to be four basic categories of learning systems, char- acterized by the entailment relation that the newly acquired knowledge must satisfy. Each entailment relation can be viewed as an 'equation' that must be solved for the 'unknown' NewKnowledge.

Page 4: Prior knowledge and autonomous learning

148 s.J. Ru~eH

(1) PriorKnowledge ~ NewKnowledge Explanation-based learning systems [38] satisfy this relation, and consequently are un- able to generate knowledge outside the agent's initial deductive closure [12].

(2) PriorKnowledge + Observations NewKnowledge We describe a system below that satisfies this relation.

(3) PriorKnowledge + NewKnowledge Observations Muggleton and Buntine [42,41] have described a reasonably complete system, CIGOL, that satisfies this relation, a form of abduction.

(4) NewKnowledge ~ Observations Knowledge-free induction, possibly with a hand-tailored hypothesis space.

An important note: a system based purely on type 2 learning would, if beginning with an empty knowledge base, only attain the deductive closure of its lifetime observations. Thus we are not claim- ing that this is the answer to the tabula rasa learning problem. An ampliative component, as provided .by the third type of learning, is still needed. However, as we show below, type 2 learn- ing can be used in a goal-directed fashion to create a highly-constrained hypothesis space in which a type 3 inductive system can search for, say, the simplest hypothesis consistent with the observations.

3. Related research in inductive and knowledge- based learning

Dietterich [12] conjectured that useful inductive biases for concept learning could not be captured semantically - in other words that domain knowl- edge could not be used to select an appropriate hypothesis space. The example given by Russell and Grosof [55] seems to contradict this view- point, although it cannot be claimed that no purely syntactic biases should enter into learning sys- tems. Several other researchers have begun to in- vestigate theoretical questions associated with our approach, notably Sridhar Mahadevan and Prasad Tadepalli at Carnegie-Mellon [31]; Manfred Warmuth at UC Santa Cruz [44], Oren Etzione at

Carnegie-Mellon and Jonathan Amsterdam at MIT. David Haussler [23] has called for further research by the theoretical machine learning com- munity into the complexity of learning in the presence of background knowledge, and has con- tributed to the results presented in [57]. Benjamin Grosof, at Stanford and IBM T.J. Watson Labs, continues to collaborate on the project, and re- ports several contributions, particularly in the area of defeasible bias, in his forthcoming thesis.

In the cognitive tradition, Pazzani [43] has simulated the learning behaviour of young children, particularly their ability to generalize from a small number of examples. His OCCAM system uses a simple inductive process to acquire a rule from examples, then assumes that the pre- mise predicates of the rule are typically involved in predicting outcomes of the kind to which the conclusion predicates belong. This assumption can be viewed as a determination of sorts, although its precise semantics is hard to ascertain.

Two other research efforts have been aimed specifically at the automatic modification of in- ductive bias. Work by Larry Rendell's group [50] on the Variable Bias Management System (VBMS) aims at finding the optimal settings on a para- meterized syntactic bias used with his PLS induc- tion system. VBMS attempts to find appropriate syntactic biases (such as 'at most three disjuncts') for recognizable problem classes. Research by Paul Utgoff [68,69] on the STABB (Shift To A Better Bias) system has focussed on adding new terms to the system's vocabulary in order to maintain con- junctive expressibility for the target concept, on the assumption that the terms so generated will be useful in future learning problems.

Research on structured induction at Edinburgh, particularly Alen Shapiro's thesis research [62], has emphasized the value of prior structuring of the domain to reduce the complexity of induction, by breaking the task into a hierarchy of smaller induction problems. Each of the smaller tasks yields a rule, and the rules together form a deeply-structured expert system for the goal con- cept. Muggleton's DUCE system [40] for inducing propositional theories extends this work by creat- ing the domain structure information automati- cally, either by analysis of examples or by using the expert as an oracle for certain well-defined queries. Muggleton's recent work extending this approach to first-order theories is discussed above.

Page 5: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning 149

4. Declarative bias

The basic approach adopted in this paper is to take the hypothesis space as an appropriate inter- mediate stage between the original unifferentiated mass of prior knowledge and the final stage, that is, the induced theory. We express the hypothesis space as a first-order sentence, hence the term declarative bias. The idea is that, given suitable background knowledge, a system can derive its own hypothesis space, appropriate for its current goal, by logical reasoning of a particular kind. In other words, rather than having to be told what hypotheses to consider for each learning task, the system can figure out what to consider from what it knows about the domain. We therefore view an agent as having initially a very weak, partial the- ory of the domain of inquiry, a theory which is useless for predictive inference. From further ob- servation, the agent can construct the needed pre- dictive capability by combining prior knowledge with the information contained in its observations. This approach seems to be much more in accord with the nature of human inductive inquiry.

4.1. Basic definitions

The concept language, that is, the initial hy- pothesis space, is a set W of candidate (concept) descriptions for the concept. Each concept descrip- tion is a unary predicate schema (open formula) ~ ( x ) , where the argument variable is intended to range over instances. The concept hierarchy is a partial order defined over ~'. The generali ty/ specificity partial ordering is given by the non- strict ordering < , representing quantified impli- cation, where we define (A < B) iff {Vx. A ( x ) B(x)}

An instance is just an object a in the universe of discourse. Properties of the instance are repre- sented by sentences involving a. An instance de- scription is then a unary predicate schema D, where D(a) holds. The set of allowable instance descriptions forms the instance language ~ . The classification of the instance is given by Q(a) or ~ Q ( a ) . Thus the i tn observation, say of a positive instance, would consist of the conjunction D+(ai) A Q(ai). A concept description Cj matches an instance a, iff C/(a~). The latter will be derived, in a logical system, from the description of the in- stance and the system's background knowledge.

Choosing a particular instance description lan- guage corresponds to believing that the instance descriptions in the language contain enough detail to guarantee that no considerations that might possibly affect whether or not an object satisfies the goal concept Q have been omitted from its description. For this reason, we call it the Com- plete Description Axiom (CDA). Its first-order rep- resentation is as follows:

Definition 1 (CDA): A (D, < Q) v ( /9/< ~Q) .

D , ~ The heart of any search-based approach to

concept learning is the assumption that the correct target description is a member of the concept lan- guage, i.e. that the concept language bias is in fact true. We can represent this assumption in first- order as a single Disjunctive Definability Axiom (DDA):

Definition 2 (DDA): V (o = cj).

c j ~ (Here we abbreviate quantified logical equivalence with ' = ' in the same way we defined ' < '.)

An important notion in concept learning is what Mitchell [35] calls the unbiased version space. This term denotes the hypothesis space consisting of all possible concepts definable on the instance language. A concept is extensionally equivalent to the subset of the instances it matches, hence we have

Definition 3 (Unbiased version space): ( C l C matches exactly some element of 2 ~ ).

As it stands, the extensional formulation of the CDA is inappropriate for automatic derivation from the system's background knowledge. A com- pact form can be found using a determination [10], a type of first-order axiom that expresses the relevance of one property or schema to another. A determination is a logical statement connecting two relational schemata. The determination of a schema Q by a schema P is written P >- Q, and defined as follows:

Definition 4 (Determination): P >- Q iff Vwx[3y[P(w,y) A P(x , y ) ]

Vz[Q(w,z ) =~ Q(x,z)]].

Determinations involving unary schemata (such

Page 6: Prior knowledge and autonomous learning

150 S.J. Russell

T

T

G ~(; G H G=H G#H ~-n ~G

F • G&~H ~G&H ~G&-H

F

Fig. 1. Unbiased ver~ion spaces for one and two boolean predicates.

as 'One's age determines whether or not one re- quires a measles vaccination in case of an outbreak ') are best expressed using truth-valued variables as virtual second arguments. Following [10], the truth-valued variable is written as a prefix on the formula it modifies. The letters O'k... are typically used for such variables. Thus the measles de- termination is written

A ge ( x , y ) > k Measles VaccineNeeded ( x ).

The addition of truth-valued variables to the lan- guage significantly reduces the length of some formulae relevant to our purposes, and allows for uniform treatment.

4.2. Basic theorems

We now give the basic theorems that establish the possibility of automatic derivation of an initial hypothesis space. Proofs are given in detail in [58], and sketched in [57].

T h e o r e m 1: The disjunctive definability axiom corresponding to an unbiased version space is logically equivalent to the complete description assumption.

t ions are iG(x) >- kQ(x) a n d iG(x) A j H ( x ) >- kQ(x).

5. T h e structure of an a u t o n o m o u s learning s y s t e m

The basic procedures in an autonomous learn- ing agent are as follows: • Derive the instnnce language bias from back-

ground knowledge and knowledge of the goal concept Q. From the derivation, we extract a restricted hypothesis space called the tree-struc- tured bias.

• Derive a stronger concept language bias from the tree-structured bias and additional knowl- edge contained in the concept hierarchy, plus syntactic biases concerning the preferred form of the ultimate concept definition.

• From the concept language bias and the in- stance descriptions with their classifications, de- rive a consistent rule for predicting the goal concept in future cases.

These procedures are illustrated in Fig. 2. We now briefly describe the various aspects of our picture of autonomous learning.

5.1. Deriving an initial bias

This section contains brief remarks on the con- siderations that apply to the process of deriving a suitable determination to form the initial hypothe- sis space for a concept learning problem. The first requirement is that the instance descriptions for- ming the most specific level of the hypothesis space must be such as to be easily observable by the agent. The second requirement is that the

T h e o r e m 2: The complete description assumption can be expressed as a single determination of the form D ( x , y ) >- k Q ( x ) , where D ( x , Y i) = Di(x) .

Corol lary: The unbiased version space can be ex- pressed as a single determination of the form D ( x , y ) ~ k Q ( x ) .

As an example of the ability of determinations to express hypothesis spaces, consider the simple case of instance languages with one and two boolean predicates (G, and G and H respectively). The unbiased version spaces for these languages appear in Fig. 1. The corresponding determina-

KB

~ deduce ~ Determinations

I preferences I ~

Fig. 2. Information flow in autonomous concept learning.

Page 7: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning 151

hypothesis space be as small as possible, since this impinges directly on the cost of the learning task. Below, a theorem is proved that indicates that the bias that the system derives can be quite restric- tive, so that the resulting learning task is relatively simple.

Although, in principle, the inference of the determination could be performed as a resolution proof, a specialized reasoner is more appropriate. What we want to get out of the inference process is a determination for the goal concept such that the left-hand side forms a maximally operational schema. The notion of operationality of a concept definition is central in the literature on explana- tion-based learning [38,26], where it refers to the utility of a concept definition for recognizing in- stances of a concept. Our use of the term is essentially the same, since the left-hand side of the determination forms the instance language bias. This means that it should be easy to form a description of the instance within the instance language it generates. For example, to learn the DangerousCarnivore concept we would like to find a bias that refers to visible features of the animal such as size and teeth, rather than to features, such as diet, whose observation may involve con- siderable cost to the observer. The particular oper- ationality criteria used will clearly depend on the situation and overall goals and capabilities of the agent. In our implementation we adopt the ap- proach taken by Hirsh [24], who expresses knowl- edge about operationality as a set of meta-level sentences. Effectively, these sentences form an 'evaluation function' for biases, and help to guide the search for a suitable instance language bias.

As well as the operationality of the instance descriptions, the expected cost of doing the con- cept learning will depend critically on the size of the hypothesis space. A weak bias will mean that a large number of instances must be processed to arrive at a concept definition. Maximizing oper- ationality for our system therefore means minimiz- ing the size of the hypothesis space that is derived from the determination we obtain. The following section describes the computation of the size of the hypothesis space corresponding to a given declarative bias derivation.

But what form does the derivation of a bias take? Since we are beginning with a goal concept for which we must find an operational determina- tion, we must be doing some kind of backward

Q

PI P2 P~ P5

Fig. 3. A bias derivation tree.

chaining. The inference rules used for the chaining will not, however, be standard modus ponens, since we are attempting to establish a universal and the premises used are usually other de- terminations, as opposed to simple implicative rules. Thus the basic process for deriving a suita- ble instance language bias is implemented as a backward chaining inference, guided by oper- ationality criteria, and using inference rules ap- propriate for concluding determinations. These in- ference rules are given in [54]. An example is the extended transitivity rule, valid for functional rela- tions:

A> -B , B A C > - D ~ - A A C > - D .

An example of a derivation tree is given in Fig. 3. The tree corresponds to the derivation of the determination

P I A P 2 A P 3 A P 4 A P s A P 6 > ' Q .

If the features /'1 through P6 are known to be operational, for example if they are easily ascer- tained through experiment, then the system will have designed an appropriate instance language for the goal concept Q, and hence an initial, 'unbiased ' hypothesis space. It is worth noting that there might be a very large number of fea- tures potentially applicable to objects in the do- main of Q, so this bias represents a considerable restriction. However, the derivation generates a much stronger restriction ye t

5.2. Tree-structured bias

It is clear that the unbiased hypothesis space derived by the above procedure will not allow successful inductive learning if used 'as is'. Some non-trivial generalization does occur, in the sense that each instance can be generalized to the class of instances with the same description in the de- rived instance language, but to obtain coverage of the domain the agent would require a number of

Page 8: Prior knowledge and autonomous learning

152 S.J. Russell

examples exponential in the number of features in the determination. If more knowledge is available than just this determination, then it should be used to further restrict the hypothesis space. It turns out that the determinations used in the derivation of the bias themselves impose a strong additional restriction on the space of possible defi- nitions for the goal concept.

Intuitively, the restriction comes about because the tree structure of the derivation limits the num- ber of ways in which the different features can interact. For example, in figure 3, P1 and P2 cannot interact separately with ~ , but only through the function which combines them. Another way to think about it is to consider q, the value of Q, as a function of the variables Pt through P6 which are the values of P1 through P6. The ' f lat ' bias determination derived above simply states that

q = f ( P l , P2, P3, P4, Ps, P6)

for some boolean function f . The tree-structured derivation in Fig. 3 shows that the form of the function is restricted:

q = f ( g ( h ( P l , P2),P3,J(P4, Ps)),P6) (1)

for some functions f , g, h, j . It is possible to derive a general formula for the number of boolean functions having a given tree structure [57]. For example, the structure in Fig. 3 allows 204304 functions, as compared to about 1019 for the cor- responding flat bias. It seems surprising that sim- ply organizing the functional expression for Q into a tree would cause a very large reduction in the number of possible functions. But in fact, the following general result holds:

Theorem 3: For a tree-structured bias whose de- gree of branching is bounded by a constant k, the number of rules consistent with the bias is bounded by (22')"-1, where n is the number of leaf nodes.

Corollary: Given a tree-structured bias as de- scribed above, with probability greater than 1 - a concept can be learned that will have error less than c from only m examples, where

m = 1 [ I n ( 1 1)2k]. , t ta) +(n- That is, the number of examples needed is linear in the number of features in the instance language. Since the size of the 'unbiased ' hypothesis space is

doubly exponential in the number of features, requiring an exponential number of examples, it seems that the tree structure represents a very strong bias, even beyond that provided by the restriction to a circumscribed set of primitive fea- tures. For comparison, a strict conjunctive bias also requires a linear number of examples. In addition, having an explicit formula for the size of the hypothesis space from a given derivation al- lows the system to minimize the size of the hy- pothesis space by choosing appropriate derivation paths when generating a bias.

To achieve learnability in the sense of Valiant [70], we must find a polynomial-time algorithm for generating hypotheses consistent with the tree- structured bias and a set of examples. Such an algorithm has been found for the case in which the functions at each internal node of the tree are restricted to be monotone (the algorithm uses membership queries rather than randomly selected examples). The general case seems more difficult. The natural process for identifying the correct rule is simply to identify the correct rule for each subtree in a bot tom-up fashion, by generating experiments that vary the features in the subtree, keeping other features constant. Since, by con- struction, internal nodes of the tree are not easily observable, the induction process is far from triv- ial. Warmuth (personal communication) has shown that a general solution to this problem would also provide a solution to the predictability problem for k-term D N F formula:. This has been an open problem since 1984. However, it should be noted that, for our purposes, even a demonstration of intractability would be only an inconvenience rather than a fundamental limitation. Our main claim is that prior knowledge can allow successful predictive behaviour from a small number of ex- amples by an autonomous learning agent. The following subsections describe ways in which the hypothesis space can be further restricted to in- crease inductive efficiency.

5.3. A dding additional knowledge

Although the tree-structured bias imposes a strong restriction on the hypothesis space, we are still a few steps away from achieving powerful learning from examples in complex domains. Par- ticularly when the individual features used in the language have large ranges of possible values, the

Page 9: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning 153

tree-structured bias derived using a knowledge base of determinations does not allow the learner to generalize quickly, resulting in slow progress in covering the domain. For example, consider the Meta-DENDRAL bias derivation in Fig. 4, adapted from [55]: at the Element node, the learner could be forced to enumerate all 92 naturally-oc- curring elements, thereby creating a highly dis- junctive theory. Instead, we would like it to con- sider appropriate, more general, classes of ele- ments, such as Group IV elements, non-metals, highly electronegative elements, and so on. In standard learning systems, this is achieved using a 'concept hierarchy'. Rather than form a disjunc- tive rule (say involving Carbon OR Silicon), one 'climbs the generalization tree' by using a more general term such as Group IV element. This gives considerably greater predictive coverage, since a rule for Group IV elements could be formed without having to see examples of all of those elements. However, such generalizations do not come for free: a system designed without regard for the laws of chemistry could easily commit gross errors in generalizing from data. Generaliza- tion to the class of elements with long names would be inappropriate. Therefore, we claim that the use of a given concept hierarchy reflects defi- nite domain knowledge.

It appears that the concept hierarchy above any predicate in a bias derivation reflects knowledge of how the determination involving that predicate came about. In other words, it forms a partial explanation of the determination. This may indi- cate the need for an additional 'phase' in the

induction process using a tree-structured bias: after the tree is constructed, each determination link should be 'explained', by expansion into a local tree structure (possibly consisting of rules as well as determinations), in order to restrict the hypothesis space still further. In this way, the effect of a concept hierarchy appropriate to the situation is obtained.

This expansion technique may also help to al- leviate combinatorial search problems that may arise in trying to find an operational instance language: just as in normal rule-based reasoning, determinations may be chained together to form new determinations that allow 'macro-steps' to be taken in the search space. Once the search has reached a suitable set of leaf nodes, the determina- tions used can be expanded out again to create a more detailed tree that therefore corresponds to a more restricted hypothesis space.

As we discuss in more detail below, the process of incorporating observations into the tree-struc- tured hypothesis space to learn a rule amounts to identifying the initially unknown function at each internal node of the tree (for example, the func- tions f , g, h, j in Eq. (1) above). Obviously, if we have extra knowledge constraining the identity of these internal functions, then once a suitable tree has been constructed, this knowledge can be im- mediately accessed to provide additional guidance for the incorporation of examples. Mitchell [39] has found that an autonomous learning robot can benefit from additional knowledge stating that certain dependencies are monotonic. For example, his robot knows that the moment of a force about

k Break](mol,site)

MSBehaviour(mol,msb)

I

Topolffgy(mol,t) BehaviourOfNodes(mol,bn) /

.....~t-~.......~2~ AtomChemistry(a,acb)

~ d n t s ( m o l , n ) ~ ' ~ Orbi21s(a,o)

Elem!nt(a,e) StructuralFormula(mol.struct)

Fig. 4. Derivation of the Meta-DENDRAL bias.

Page 10: Prior knowledge and autonomous learning

154 s.J. Russell

a point is determined by the distance from its point of application, but also that the dependence is a monotonically increasing one.

6. Updating the hypothesis space

To complete the edifice built around the notion of declarative bias, it remains to show how the process of updating the hypothesis space with new instances can be implemented as a normal first- order deduction. We first describe the simplest approaches using the Disjunctive Definability Axiom, and its determination form. We then dis- cuss more practical implementations for stronger versions of the concept language bias, in particular the tree-structured bias.

The simple-minded approach to updating the version space is to do forward resolution between the instance observation facts and the disjuncts in the DDA, i.e., the candidate concept descriptions. Effectively, each candidate ~ will resolve against an instance that contradicts it, with the help of the system's background knowledge (the articulation theory Tha). Thus, as more instances are observed, the D D A will shrink, retaining only those concept descriptions that are consistent with all the in- stances. Classification of a new instance a, using an intermediate version space can be done by a resolution proof for the goals Q(ai) and -~Q(a,) using the current D D A as the database. (Note that these processes are in general only semi-decidable.) The algorithms can be simply stated as follows: Updating the version space (simple DDA method): (1) For each instance description D~(a~) A

k~Q(ai): (a) Resolve the instance description against

each remaining disjunct of the DDA. (b) If a contradiction is found with a disjunct,

remove the disjunct from the DDA. (c) Otherwise do nothing.

(2) If one disjunct remains in the DDA, return it as the concept definition.

(3) If no disjuncts remain, we have a contradic- tion, and the bias needs to be weakened. 5

Classifying new instances (simple D D A method). (1) Given an instance description D~(a~) (classifi-

cation unknown). (2) Add it to the DDA and attempt to prove a

5 Alternatively, if the domain is noisy, we may wish to allow a certain percentage of classification errors.

contradiction with the positive goal Q(a, ) . If a contradiction appears, a, is a negative in- stance.

(3) Add it to the DDA and at tempt to prove a contradiction with the negated goal - ,Q(a , ) . If a contradiction appears, a, is a positive in- stance.

(4) Otherwise, there is insufficient information to classify a,.

In the straightforward method using an explicit DDA, each disjunct of the D D A must be put in conjunctive normal form, since the resolutions are carried out separately between the instances and each disjunct. No useful resolutions are lost, since all the disjuncts of the D D A are mutually incon- sistent by definition.

The determination representation for the hy- pothesis space can also be used directly with a deductive updating approach. This can be il- lustrated using the simple one-predicate language in Fig. 1. The determination form is

i P ( x ) ~ ' k Q ( x ) .

The truth-valued variables can be handled in a straightforward fashion. The formula is trans- formed into CNF as usual. The standard resolu- tion test for complementary literals is altered to allow for the presence of the truth-valued varia- bles, which effectively unify with the 'sign' of the literal being matched. Hence ~P(a) is comple- mentary to kP(a) with unifier { k / + }. 6 The premises for the forward resolution inference are the determination and instance facts, here one negative and one positive instance:

-~iP(x) V ~ i P ( y ) v ~ k Q ( x ) v k Q ( y ) (2)

e (a ) (3)

Q(a) (4)

(5) -,Q(b). (6)

Resolving 2 and 3 with unifier { x/a , i~ + } we get

~ p ( y ) V ~ k Q ( a ) v kQ(y ) . (7)

Resolving 4 and 7 with unifier { k~ + } we get

~ P ( y ) v Q(y) . (8)

6 The propriety of this treatment is assured, since we can view it as theory resolution with a theory containing equivalences for all literals, such that each literal is equivalent to an extended literal with an extra argument. For example, P(a) becomes P(a, +), --,P(a) becomes P(a, - ).

Page 11: Prior knowledge and autonomous learning

Prior knowledge an'd dtiiOiioitibUs learning 155

Resolving 2 and 5 with unifier { y/b, i / - } we get

P ( x ) v ~ k Q ( x ) v kQ(b) , (9)

Resolving 6 and 9 with unifier { k / + } we get

P ( x ) V -~Q(x). (10)

8 and 10 together give us the concel~t ti~fifiitioti Q - - P . Clearly, with more than one predicate in the language, the determination method is much more efficient than the DDA method.

5.1. Updating using a tree-structured bias

Clearly, the DDA approach is impractical when the space of hypotheses is too large, Th~ de- termination form is compact and e(fiei~ht f0r ati unbiased space, but the updating procedure needs to be elaborated considerably to deal with a tree- structured bias. Although any hypothesis space can be searched by techniques that amount to simulating the current-best-hypothesis search of Winston [72], the obvious direction is to take advantage, as Mitchell did; of the generalization partial ordering between concept descriptions to express compactly the set of hypotheses consistent with both the examples and the original bias: The tree structure of the hypothesis space allows for a somewhat more localized updating process than is the case with Mitchell's candidate elimination pro- cess. Essentially, the tree-structured bias presents a set of smaller learning problems, namely to identify the unknown function at each internal node in the tree and at the root node. The identifi- cation can be done using any of a number of inductive methods. The most straightforward is to use a version space at each node, with the classifi- cation information needed to make updates being gradually propagated from the top and bottom of the tree as new examples come in. Additional constraints, such as concept hierarchies, mono- tonic dependencies, or even complete theories for the internal nodes, can be easily incorporated into such an algorithm. A preliminary version of this algorithm has been implemented and tested, but is still under development, and its complexity is not yet known. As pointed out above, a polynomial- time algorithm for finding a hypothesis consistent with a set of examples would solve a long-standing open problem in learning theory, namely the pre- dictability of small DNF formulae. In practice, it is not necessary to find completely-specified rules

consistent with the data: often it is sufficient simply to constrain the function at a node in the tree, so that for example after a few cases one might discover that the dependence of weight on calorie intake was monotonic, and this could en- able successful classification of future cases.

A perhaps more radical and interesting ap- proach is to solve each identification problem using a ~o~i~t ionis t network, one for each node in the tree. In cffs~ where little or no structuring information is available f0Y the version space for a given node, the connectionist approach can help to induce additional structure and generate new terms to simplify the overall concept description. From the point of view of the connectionist enter- prise, the knowledge-based derivation of the tree- situctur~! bias provides an ideal way to integrate prior kfit~ledg~ into a connectionist learning sys- tem, since it strongly restricts the typically enor- mous weight spaces that would otherwise be searched. A connectionist approach to learning the node functions has the additional advantage that information can propagate through the tree fasier, since each subnetwork will classify its in- puts sooner than the least-commitment version- space learning algorithm, and will be able to ~olefate b~tter the inevitable noise this will entail. An expetirnerttal research program has been ini- tiated to explore this avenue [58].

7. Future work

Several meaty theoretical and implementat ion tasks remain before the model of autonomous learning can be fully realized. The tasks form an approximately well-structured Sequence as fol- lows:

Extension of the bias derivation subsystem A pilot version of the bias derivation phase is

already in operation, providing proof-of-concept, but needs to be extended to handle non-trivial operationality theories and new inference rules, and to take into account version space size using the formulae developed in [57]. As new inference rules are introduced, the formulae for version space size will need to be extended.

Development of domain theories We are currently working to build a knowledge

base of determinations concerned with the various

Page 12: Prior knowledge and autonomous learning

156 S.J. Russell

aspects of molecular structure and resulting physi- cal and chemical properties. We intend to use this as a basis for demonstrating autonomous knowl- edge-guided learning and experimentation on a variety of goal concepts. After this, we expect to work with other experts in the domains of mecha- nical device design and diagnosis, creditworthiness assessment, and medicine. Another possibility being explored with the robotics faculty at Berke- ley is that of providing a robot with a partial theory of its environment and capabilities, and having it generate the necessary practical experi- ments to provide the information required by its problem-solver. A ball-throwing robot, for exam- ple, might run a few timing tests to determine the acceleration due to gravity in its neighbourhood.

Learning determinations Determination knowledge bases can be con-

structed by induction over standard rule bases and case libraries, as discussed in [54]. However, the algorithms given there are quite rudimentary, and do not use prior knowledge to reduce search. It is likely that adapting the above ideas to the acquisi- tion of determinations from examples should be fairly straightforward, since the unbiased hypothe- sis space for determinations is the powerset of the set of all predicates applicable to the domain of the goal concept. Some modifications will be needed to deal with the fact that determinations must be learned from pairs of matching examples.

Bias shift as nonmonotonic reasoning It seems clear that the kind of reasoning lead-

ing to a strong inductive bias can seldom be guaranteed correct; typically, the premises in such derivations should be viewed as defaults. As in the STABB system [68,69], when experience of actual cases contradicts an inductive bias the system must fall back to a weaker bias. This process, called bias shift, can be formally modelled as a prioritized, non-monotonic reasoning process in a formalism developed by Grosof [21], and demon- strated in [55]. When the instances, which are usually considered as having the highest priority, contradict the original bias, then a weaker, lower- priority bias inference can go ahead, itself subject to revision if necessary. For example, in finding rules to predict the weather, one's knowledge of physics would suggest ignoring the day of the week, but when one cannot otherwise explain a weekly variation, one might add in the further

consideration of weekday smog production. Work on the implementation of non-monotonic rea- soning, for example [18], has progressed to a point at which it is now feasible to extend our model of autonomous learning to include this kind of bias shift capability.

Explicit uncertainty As a complement to the ability to use non-

monotonic inference to handle uncertainty in bi- ases, it is important to be able to handle premises with attached probabilities, and to be able to generate explicitly probabilistic rules if necessary. Determinations that are less than completely cer- tain are called partial determinations, and their probabilistic definition has been developed [54]. Mahadevan and Tadepalli [31] have shown that partial determinations in a background theory can also constrain a learning problem so as to be tractable, given a sufficiently small deviation from complete certainty. The certainty of a determina- tion can be increased by adding extra premises, but this reduces its utility as an inductive bias by enlarging the corresponding hypothesis space. It should therefore be possible to find appropriate trade-offs between certainty and strength of bias. In addition, quantified uncertainty in the bias should enable the system to make principled choices between detailed, high-certainty but possi- bly overfitted theories, and general, efficient but possibly inaccurate theories [48].

Incorporation into an autonomous agent As part of the RALPH project at Berkeley, a

simulated, multi-agent, non-deterministic environ- ment has been built and serves as the testbed for our research on ralphs. Agents in this environment must process raw sensory data into theories of the environment which are used to make decisions. Inductive reasoning must take place at both the base-level (theories of the world) and the meta- level (theories for control reasoning). As each agent acquires a knowledge base of determinations, its inductive performance for subsequent tasks will improve. We will examine the effect of the nature of the environment on the efficacy of this bootstrapping process.

New term generation In the inductive context, the main effect of new

term generation will be to simplify the description of interesting concepts, and thereby to make them

Page 13: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning 157

more accessible to resource-bounded inductive mechanisms which use a strong syntactic bias. Muggleton and Buntine [42] have implemented a system, CIGOL, that uses reverse resolution to generate the most compact theory that implies a set of ground facts. In the reverse resolution pro- cess, new predicates may be introduced that have partial definitions in terms of existing predicates. We have re-implemented the CIGOL system, and intend to analyse its behaviour when given a back- ground theory consisting partly of determinations. The effect of combining new-term generation methods with the knowledge-based inductive methods described above will be investigated in the RALPH system. A particular goal of this investigation is to find a 'synthesis route' for the notions of position, space and motion, given that the agent begins with only sensory primitives.

Scientific inference A complete model, however skeletal, of autono-

mous learning cannot avoid impinging on the study of scientific inference, as carried out by methodol- ogists and philosophers of science for many centu- ries. Appropriately, and in anticipation of our efforts at mechanization, Putnam [46] urges us to abandon the search for a confirmation theory giving the relationship between data and degree of belief in a hypothesis; instead, we are to study the process of selecting a consistent theory from among a set of possibilities. As in other proposals, however, including confirmation theory itself, no attempt is made to identify mechanisms leading to the generation of a hypothesis or set of hypotheses for assessment or selection. Prior knowledge clearly plays an important role in the generation of hy- potheses in scientific research and discovery. Several AI researchers have examined 'discovery learning' [27,28,29,51,73], but to date not much work has been carried out to model the use of existing domain knowledge to guide the process of investigation. Machine learning investigators and scientists alike may spend many hours deciding on an appropriate space of hypotheses and an ap- propriate description language for examples, be- fore embarking on actual experiments. For in- stance, a chemist attempting to understand mole- cule behaviour in a mass spectroscope ignores the nuclear spin states and isotopic species of her sample's molecules, concerning herself only with topological structure. An specialist in nuclear

magnetic resonance does exactly the opposite. Without such skills, the range of possible experi- ments would be vast.

We hope to study cases of scientific experimen- tation to ascertain the possible state of knowledge of the experimenters leading to the selection of an original set of hypotheses, following the lines sketched above for the analysis of Meta-DEN- DRAL. Examples include the development of the laws of gravity and of electrostatic attraction, both simple theories, and the refinement of our under- standing of gene expression mechanisms, a more complex task. The latter investigation will be based on the already well-developed formalization of the domain theory possessed by the original investiga- tors, reported by Karp [25].

8. Summary

Lest it be swamped by vast hypothesis spaces, an autonomous system must use all the knowledge it possesses to make its inductive learning maxi- mally effective. The declarative expression and deductive generation of bias holds promise for allowing the creation of autonomous learning sys- tems. Autonomy is clearly essential in many situa- tions, but in addition the proposed system will relieve humans of the exceedingly difficult task of hand-coding a bias for each individual learning task. 7 It is fair to say that this bottleneck has been the major reason for the non-existence of learning systems as general add-ons to perfor- mance programs. It is hoped that this work will also shed light on the process of bias creation, and on the more general problem of hypothesis gener- ation in scientific research.

The kind of system being proposed is one in which any and all available knowledge can be brought to bear on each learning problem, so that as more is learnt, more can be learnt. A theoretical basis has been described, directions for further work have been mapped out and a system archi- tecture has been outlined. Although the ultimate goals of the research are long-term, it is hoped that payoff in the form of significantly more ap- plicable machine learning systems will appear in the near future.

7 Quinlan [47] has reported a span of 3 months for the creation of an appropriate bias for his chess end game application.

Page 14: Prior knowledge and autonomous learning

158 S.J. Russell

References

[1] D. Anghiin and C.H. Smith, Inductive inference: Theory and methods, Computing Surveys 15 (3) (1983) 237-269.

[2] L. Blum and M. Blum, Toward a mathematical theory of inductive inference, Information and Control 28 (1975) 125-155.

[3] J.S. Bruner, J.J. Goodnow and G.A. Austin, A Study of Thinking (Wiley, New York, 1956).

[4] B.G. Buchanan and T.M. Mitchell, Model-directed learn- ing of production rules, in: D.A. Waterman and F. Hayes-Roth, eds., Pattern-Directed Inference Systems (Academic Press, New York, 1978).

[5] B.G. Buchanan, T.M. Mitchell, R.G. Smith and C.R. Johnson, Jr., Models of learning systems, Technical report STAN-CS-79-692, Computer Science Department, Stan- ford University, Stanford, CA (1979).

[6] A. Bundy, B. Silver and D. Plummer, An analytical com- parison of some rule-learning programs, Artificial Intelli- gence 27 (1985).

[7] W. Buntine, Generalized subsumption and its application to induction and redundancy, Proc. of ECAI-86, Brighton (1986).

[8] E. Charniak and D. McDermott, Introduction to Artificial Intelligence (Addison-Wesley, Reading, MA, 1985).

[9] T. Davies, Analogy, Informal Note CSLI-IN-85-4, CSLI, Stanford, CA. (1985).

[10] T.R. Davies and S.J. Russell, A logical approach to rea- soning by analogy, Proc. of IJCAI-87, Milan (Morgan Kaufmann, Los Altos, CA, 1987).

[11] T.R. Davies and S.J. Russell, Relevance and uniformity: Extensions to determination-based learning. Unpublished manuscript (1988).

[12] T.G. Dietterich, Learning at the knowledge level, Machine Learning 1 (3) (1986).

[13] N.S. Flann and T.G. Dietterich, Selecting appropriate representations for learning from examples, Proc. of the Fifth National Conference on Artificial Intelligence (Morgan Kaufmann, Philadelphia, PA, 1986).

[14] R.M. Friedberg, A learning machine: Part 1, IBM Journal 2 (1958) 2-13.

[15] R. Friedberg, B. Dunham and T. North, A learning mac- hine: Part 2, IBM Journal of Research and Development 3 (1959) 282-287.

[16] L.-M. Fu, Learning Object-level and Meta-level Knowl- edge for Expert Systems, Ph.D. thesis, Computer Science Department, Stanford University, Stanford, CA (1985).

[17] M.R. Genesereth, An overview of meta-level architecture, Proc. of AAAI-83, Austin, TX (Morgan Kaufmann, Los Altos, CA, 1983) 119-124.

[18] M.L. Ginsberg, A circumscriptive theorem prover: Pre- liminary report, Proc. of the Seventh National Conference on Artificial Intelligence, Minneapolis, MN (Morgan Kaufmann, Los Altos, CA, 1988).

[19] E.M. Gold, Language identification in the limit, Informa- tion and Control 10 (1967) 447-474.

[20] N. Goodman, Fact, Fiction and Forecast, (Harvard Uni- versity Press, Cambridge, MA, 1955).

[21] B.N. Grosof, Non-monotonic theories: Structure, in- ference, and applications (working title). Ph.D. thesis (in preparation), Stanford University, Stanford, CA.

[22] D. Haussler, Quantifying inductive bias: AI learning al- gorithms and valiant's learning framework, Technical re- port, Department of Computer Science, University of California, Santa Cruz, CA (1988).

[23] D. Haussler, Theoretical results in machine learning, In- vited talk, Fifth International Machine Learning Con- ference, Ann Arbor, MI (1988).

[24] H. Hirsh, Explanation-based generalization in a logic pro- gramming environment, Proc. of the Tenth International Joint Conference on Artificial Intelligence, Milan (1987).

[25] P.D. Karp, A process-oriented model of bacterial gene regulation, Unpublished manuscript, Knowledge Systems Laboratory, Stanford University, Stanford, CA.

[26] R.M. Keller, Defining operationality for explanation- based learning, Proceedings of the Sixth National Con- ference on Artificial Intelligence, Seattle, WA (1987).

[27] D.B. Lenat, AM: An artificial intelligence approach to discovery in mathematics as heuristic search, Ph.D. thesis, Computer Science Department, Stanford University, Stan- ford, CA (1976).

[28] D.B. Lenat, Theory formation by heuristic search: the nature of heuristics II: background and examples, Artifi- cial Intelligence 21 (1983) 31 59.

[29] D.B. Lenat, EURISKO: A program that learns new heur- istics and domain concepts, The nature of heuristics Ill: Program design and results, Artificial Intelligence 21 (1983).

[30] D. Lenat, M. Prakash and M. Shepherd, CYC: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks, AI Magazine 6 (1986) 65-85.

[31] S. Mahadevan and P. Tadepalli, On the tractability of learning from incomplete theories, Proe. of the Fifth Inter- national Machine Learning Conference, Ann Arbor, MI (Morgan Kaufmann, Los Altos, CA, 1988).

[32] R.S. Michalski, A theory and methodology of inductive learning, Artificial Intelligence 20 (2) (1983).

[33] J.S. Mill, (1843), System of Logic, Book III Ch XX 'Of Analogy', in: Vol. VIII of Collected Works of John Stuart Mill (University of Toronto Press, 1973).

[34] T.M. Mitchell, Version spaces: An approach to concept learning, Ph.D. thesis, Computer Science Department, Stanford University, Stanford, CA (1978).

[35] T.M. Mitchell, The need for biases in learning generaliza- tions, Technical report TR CBM-TR-117, Computer Sci- ence Department, Rutgers University, New Brunswick, NJ (1980).

[36] T.M. Mitchell, Generalization as search, Artificial Intelli- gence 18 (2) (1982) 203-226.

[37] T.M. Mitchell, P. Utgoff and R. Banerji, Learning by experimentation: Acquiring and refining problem-solving heuristics, in: J.G. Carbonell, R. Michalski and T. Mitchell eds.. Machine Learning: An Artificial Intelligence Ap- proach (Tioga Press, Palo Alto, CA, 1983).

[38] T.M. Mitchell, R.M. Keller and S.T. Kedar-Cabelli, Ex- planation-based generalization: A unifying view. Machine Learning 1 (1986) 47-80.

[39] T.M. Mitchell, Can we build learning robots? Proc. of the workshop on representation and learning in an autonomous agent, Lagos, Portugal (in press).

[40] S.H. Muggleton, DUCE: An oracle based approach to

Page 15: Prior knowledge and autonomous learning

Prior knowledge and autonomous learning i 59

constructive induction, Proc. of the Tenth International Joint Conference on Artificial Intelligence, Milan, Italy (1987).

[41] S.H. Muggleton, A strategy for constructing new predi- cates in first-order logic, Proc. of the Third European Working Session on Learning (Pitman, Glasgow, 1988).

[42] S.H. Muggleton and W. Buntine, Machine invention of first-order predicates by inverting resolution, Proc. of the Fifth International Machine Learning Conference, Ann Arbor, MI (Morgan Kaufmann, Los Altos, CA, 1988).

[43] M. Pazzani, M. Dyer and M. Flowers, The role of prior causal theories in generalization, Proc. of the Fifth Na- tional Conference on Artificial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[44] L. Pitt and M. Warmuth, Prediction-preserving reducibil- ity, Technical report UCSC-CRL-88-26, Computing Re- search Laboratory, University of California, Santa Cruz, CA (1988).

[45] G.D. Plotkin, A note on inductive generalization, in: B. Meltzer and D. Michie, eds., Machine Intelligence 5 (Elsevier, New York, 1970).

[46] H. Putnam, Probability and confirmation, in: Mathe- matics, Matter and Method (Cambridge University Press, Cambridge, 1975).

[47] J.R. Quinlan, Learning efficient classification procedures and their application to chess end games, in: J.G. Carbonell, g. Michalski and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach (Tioga Press, Palo Alto, CA, 1983).

[48] J.R. Quinlan, Induction of decision trees, Machine Learn- ing 1 (1) (1986) 81-106.

[49] L. Rendell, A general framework for induction and a study of selective induction, Machine Learning 1 (1986).

[50] L. Rendell, R. Seshu and M. Tcheng, Dynamically varia- ble bias management for robust concept learning, Proc. of the Tenth International Joint Conference on Artificial Intel- ligence, Milan, Italy (Morgan Kaufmann, Los Altos, CA, 1987).

[51] D. Rose and P. Langley, Chemical discovery as belief revision, Machine Learning 1 (1986) 423-451.

[52] S.J. Russell, The compleat guide to MRS, Technical Re- port No. STAN-CS-85-1080, Stanford University, Stan- ford, CA (1985).

[53] S.J. Russell, Preliminary steps toward the automation of induction, Proc. of the Fifth National Conference on Artifi- cial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[54] S.J. Russell, Analogical and inductive reasoning, Ph.D. thesis, Computer Science Department, Stanford Univer- sity, Stanford, CA (1986).

[55] S.J. Russell and B.N. Grosof, A declarative approach to bias in concept learning, Proc. of the Sixth National Con- ference on Artificial Intelligence, Seattle, WA (1987).

[56] S.J. Russell and D. Subramanian, Mutual constraints on

representation and inference, in: P. Brazdil, ed., Proc. of the Workshop on Machine Learning, Meta-Reasoning and Logics, Sesimbra, Portugal, (1988).

[57] S.J. Russell, Tree-structured bias, Proc. of the Seventh National Conference on Artificial Intelligence, Minneapolis, MN (Morgan Kaufmarm, Los Altos, CA, 1988).

[58] S.J. Russell, The use of knowledge in analogy and induction (Pitman, London, 1989).

[59] S.J. Russell, Execution architectures and compilation, Proc. of the Eleventh Int. Joint Conference on Artificial Intelligence, Detroit, MI (Morgan Kaufmarm, Los Altos, CA, 1989).

[60] J.C. Schlimmer, Incremental adjustment of representa- tions for learning, Proc. of the Fourth International Workshop on Machine Learning, University of California, Irvine, CA (Morgan Kaufman, Los Altos, CA, 1987).

[61] E.Y. Shapiro, Inductive inference of theories from facts, Technical Report 192, Department of Computer Science, Yale University, New Haven, CT (1981).

[62] A. Shapiro, Structured Induction (Kluwer Academic Pub- lishers, Amsterdam, 1987).

[63] H.A. Simon, The Sciences of the Artificial (MIT Press, Cambridge, MA, 1982).

[64] H.A. Simon, Why should machines learn? in: J.G. Carbonell, R. Michalski and T. Mitchell, eds., Machine Learning: an Artificial Intelligence Approach (Tioga Press, Alto, CA, 1983).

[65] H.A. Simon and G. Lea, Problem solving and rule induc- tion: A unified view, in: L.W. Gregg, ed., Knowledge and Cognition (Erlbaum, Hillsdale, N J, 1974).

[66] D. Subramanian and J. Feigenbaum, Factorization in experiment Generation, Proc. of the Fifth National Con- ference on Artificial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[67] J.D. Ullman, Principles of Database Systems (Computer Science Press, 1983).

[68] P.E. Utgoff, Shift of Bias for Inductive Concept Learning, Ph.D. thesis, Computer Science Department, Rutgers Uni- versity, New Brunswick, NJ (1984).

[69] P.E. Utgoff, Shift of bias for inductive concept learning, in: J.G. Carbonell, R. Michalski and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach, Vol. II (Morgan Kaufmann, Los Altos, CA, 1986).

[70] L.G. Valiant, A theory of the learnable, Communications of the ACM 27 (1984) 1134-1142.

[71] S. Watanabe, Knowing and guessing: A formal and quanti- tative study (Wiley, New York, 1969).

[72] P. Winston, Learning structured descriptions from exam- ples, Ph.D. thesis, Artificial Intelligence Laboratory, Mas- sachusetts Institute of Technology, Cambridge, MA (1970).

[73] J.M. Zytkow and H.A. Simon, A theory of historical discovery. The construction of componential models, Machine Learning 1 (1986) 107-136.