prior knowledge and autonomous learning

Robotics and Autonomous Systems 8 (1991) 145-159 145 North-Holland

Prior knowledge and autonomous learning

S t u a r t J. R u s s e l l

Computer Science Division, University of California, Berkeley, CA 94720, USA

Abstract

Russell, S.J., Prior knowledge and autonomous learning, Robotics and Autonomous Systems, 8 (1991) 145-159.

This paper is concerned with the construction of autonomous learning agents, in particular those that use existing knowledge in the pursuit of new learning goals. Inductive learning has often been characterized as a search in a hypothesis space for hypotheses consistent with observations. It is shown that committing to a given hypothesis space is equivalent to believing a certain compact, first-order sentence. The process of learning a concept from examples can therefore be implemented as a derivation of the appropriate sentence corresponding to a hypothesis space for the goal concept, followed by a first-order deduction from this sentence and the facts describing the instances. At any point during the process, standard inductive methods can be used to select among the remaining hypotheses. Thus, by applying prior knowledge to the process of deriving a hypothesis space, the system is able to learn autonomously from a reasonable number of examples.

Keywords: Determination; Tree-structured bias; Hypothesis space; Knowledge-guided learning; Induction; Machine learning; Logic.

1 . I n t r o d u c t i o n

The R A L P H (Rat ional Agents with Limited Performance Hardware) research project, cur-

rently being conducted at the Universi ty of Cali- fornia at Berkeley, has as one of its aims the

provision of an au tonomous learning capabil i ty for si tuated agents. This has of course been a

dream of artificial intell igence researchers for quite some time, but recent conceptual and theoretical developments give one reason to believe that at least partial success is close at hand.

Stuart Russell was born in 1962 in Portsmouth, England. He received his B.A. with first-class honours in Physics from Oxford University in 1982, and his Ph.D. in Computer Science from Stanford in 1986. He is currently on the faculty of the Computer Science Division of the University of Cali- fornia at Berkeley. His research inter- ests include machine learning, limited rationality, real-time decision-making,

• game-p.laying, and representation and reasomng in common-sense domains.

In 1990 he received the NSF Presidential Young Investigator Award. His hobbies include long walks on deserted coasts and looking after stray cats.

We begin by indicat ing some of the require- ments for au tonomous learning, and our assump-

tions about the architecture of the agent wi thin • which the learning takes place. We then focus on

the vital role played by prior knowledge in each learning episode. We develop a formal basis for

using prior knowledge in learning, and sketch the

design of a learning system with these capabilit ies.

A system is au tonomous to the extent that its behaviour is determined by its immedia te inputs

and past experience, rather than by its designer's. A system that operates on the basis of bui l t - in assumptions will only operate successfully when those assumptions hold, and thus lacks flexibility. A truly au tonomous system should be able to operate successfully in any universe, given sufficient time to adapt. We don ' t wish to equate au tonomous systems with t abu la rasa systems, however, since this seems a somewhat impract ical

way to proceed. A reasonable hal fway-point is to design systems whose behaviour is de termined in

large part, at least initially, by the designer 's knowledge of the world, bu t where all such as- sumpt ions are as far as possible made explicit and amenable to change by the agent. This sense of

0921-8830/91/$03.50 © 1991 - Elsevier Science Publishers B.V. All rights reserved

146 s.J. Russell

autonomy seems also to fit in reasonably well with our intuitive notions of intelligence.

For a notion of learning in such systems, the following definition seems acceptable: learning takes place when the system makes changes to its internal structure so as to improve some metric on its long-term future performance, as measured by a fixed performance standard (cf. Simon's definition in [64]). It also seems clear that the performance standard must ultimately be externally im- posed [5], particularly since, for the purposes of building useful artifacts, modification of the performance standard to flatter one's behaviour does not exactly fit the bill.

The inputs to an autonomous learning agent must be quite restricted. There are three 1 essential aspects of experience: (1) Perceptions that reflect the current state of the

environment. (2) Perception of the agent's own actions. (3) Information as to the quality of the agent's

performance. The agent's perceptions may be partial, intermit- tent and unreliable. The truth of the agent's perceptions is irrelevant (or, to put it another way, each perception carries a guarantee of its own truth). What is important is that the perceptions be faithful in the following sense: there is a consistent relationship between the agent's perceptions and the performance feedback. The relationship can be arbitrarily complex and uncertain - the more so, the more difficult the learning problem. Beyond this, it doesn't matter what the perceptions signify; autonomous learning can take place in real or simulated environments, or in the proverbial vat.

Any attempt to provide a learning capability must make some assumptions about the execution architecture of the agent: what is the structure of the part that actually chooses the actions? Here we borrow another argument from Simon [63]. If we view learning as searching the space of possible selves for an optimal configuration, then immediately there is a complexity problem: for agents

of non-trivial complexity (e.g., a simple computer with 1MB of memory), the space of possible con- figurations is absurdly large. Without further constraints, learning will fail. This was the case in the early program mutation experiments of Friedberg and others [14,15]. The complexity argument goes as follows: The operation of the learning component effects changes on parts of the agent's internal structure (one might take this as a definition of 'parts'). Suppose we have a simple system with 1,000 parts, each of which can be in 10 states. If we can find a global optimum by independently optimizing each of the 1,000 parts, then the search will take on the order of 10,000 steps; on the other, more horrendous, hand, if the parts are not independently optimizable the search will take on the order of 10 l°°° steps. A system composed of beliefs, by which is meant a system whose self- modification operations can be viewed as belief revision or acquisition, 2 can optimize its parts independently, since, to put it simply, making a belief truer must improve the performance of the system. Given the basic categories of inputs listed above, some obvious candidates for beliefs would include beliefs about the state of the world, beliefs about the effects of actions and beliefs about the relationship between the state of the world and the level of performance quality feedback. More 'compiled' components are also possible [59]. Thus it seems that an essential aspect of learning is the ability to acquire general beliefs, or universals, since only these allow extension of past experience to future performance. In artificial intelligence, acquisition of universals is commonly called concept learning, and it is on the problem of autonomous concept learning that this paper will focus.

2. Learning and prior knowledge

The object of concept learning is to come up with predictive rules that an intelligent agent can use to survive and prosper. For example, after being 'presented' with several instances, an agent

i One might argue that perception of the agent's internal computations is also necessary for certain kinds of learning; these can be included in the 'environment' and 'actions'.

2 This is a fairly broad definition, since it includes, for example, the creation of direct, gated links from sensors to effec- tors, these being viewed as beliefs about the conditional optimality of a certain action.

Prior knowledge and autonomous learning 147

might decide 3 that it needed to discover a way to avoid being eaten, and eventually learns that large animals with long, pointy teeth and sharp claws are carnivorous. It is now fairly well accepted that the process of learning a concept from examples can be viewed as a search in a hypothesis space (or version space) for a concept definition consistent with all examples 4, both positive and negative [36,1]. The original inputs to this process are epi- sodes in full Technicolor, that is, sensory-level data. There may be some automatic bottom-up processing to establish more abstract descriptions, but the scenes are still extremely rich, and the agent may have much more information besides that which is immediately obvious from its sensors (such as that today is Tuesday). The agent's job is to form a mapping from this input data to an effective response, such as Run Away. Suppose that the agent begins with no prior constraints on what constitutes an appropriate mapping. Then basic results in theoretical machine learning [70] tell us that the agent will need to see an awful lot of examples in order to learn an appropriate response, and may in fact perish before learning to run away (or, equivalently, perish by continually fleeing from otherwise edible objects).

Current learning systems 'solve' this fundamental problem by being given a highly restricted hypothesis space and highly abstracted instance descriptions carefully designed by the programmer for the purposes of learning the concept that the programmer wants learnt. The job of the learning program under these circumstances is to 'shoot down' inconsistent hypotheses as examples are analysed, rather like a sieve algorithm for finding prime numbers. In practice this task requires some extremely ingenious algorithms, but it is only one aspect of the whole learning problem. We need systems that can construct their own hypothesis

3 The subject of the generation of goals for learning, though an important one, is not specifically addressed in this paper, although it forms part of the work on the RALPH project.

,s Consistency with all examples is in fact an over-strict crite- rion; in domains containing noise, it has been shown by Quinlan [48] that better predictive performance is obtained by allowing some inconsistency, to avoid overfining the data. Exact consistency can also result in a very complex theory which is extremely hard to use in practice, whereas a simpler but incorrect theory can give better overall performance.

spaces and instance descriptions, for their own goals. Bundy's charge [6] is worth repeating:

'Automatic provision ... of the description space is the most urgent open problem facing automatic learning.'

Consider the rather large space of all possible hypotheses about when to run away that are definable on the agent's sensory-level instance descriptions. Any effective method of autonomous learning must allow the agent to discard some of these hypotheses for reasons other than simple consistency with the instances for this particular learning problem. Since the discarded hypotheses are factual statements, contingent constraints on the world, then discarding the right hypotheses must reflect knowledge on the part of the agent, either explicit or implicit. Discarding hypotheses at random (randomly, that is, with respect to their truth) would not improve the agent's learning abilities at all. Using prior knowledge to learn thus seems to be the only way to acquire complex skills quickly. Knowledge-free learning must have happened once in order to get things 'off the ground', but it seems that this is a rather less interesting, special case.

Prior knowledge can be used in the following simple way: the agent can generate all possible hypotheses expressible in terms of the primitive language, and test them for consistency with its prior knowledge. In this way the number of examples needed for learning will be reduced, but the agent will still be faced with an absurd amount of computation. What is needed is a more structured approach so that the agent can begin with a goal, the concept to be learned, and use its prior knowledge to construct a restricted hypothesis space and an appropriately abstracted instance description language. This paper summarizes and extends research developed in [53,55,57] that aims at solving this problem.

We can summarize the relationship between the kind of learning proposed herein and the other major lines of learning research. There appear to be four basic categories of learning systems, characterized by the entailment relation that the newly acquired knowledge must satisfy. Each entailment relation can be viewed as an 'equation' that must be solved for the 'unknown' NewKnowledge.

148 s.J. Ru~eH

(1) PriorKnowledge ~ NewKnowledge Explanation-based learning systems [38] satisfy this relation, and consequently are un- able to generate knowledge outside the agent's initial deductive closure [12].

(2) PriorKnowledge + Observations NewKnowledge We describe a system below that satisfies this relation.

(3) PriorKnowledge + NewKnowledge Observations Muggleton and Buntine [42,41] have described a reasonably complete system, CIGOL, that satisfies this relation, a form of abduction.

(4) NewKnowledge ~ Observations Knowledge-free induction, possibly with a hand-tailored hypothesis space.

An important note: a system based purely on type 2 learning would, if beginning with an empty knowledge base, only attain the deductive closure of its lifetime observations. Thus we are not claim- ing that this is the answer to the tabula rasa learning problem. An ampliative component, as provided .by the third type of learning, is still needed. However, as we show below, type 2 learning can be used in a goal-directed fashion to create a highly-constrained hypothesis space in which a type 3 inductive system can search for, say, the simplest hypothesis consistent with the observations.

3. Related research in inductive and knowledge- based learning

Dietterich [12] conjectured that useful inductive biases for concept learning could not be captured semantically - in other words that domain knowledge could not be used to select an appropriate hypothesis space. The example given by Russell and Grosof [55] seems to contradict this view- point, although it cannot be claimed that no purely syntactic biases should enter into learning systems. Several other researchers have begun to in- vestigate theoretical questions associated with our approach, notably Sridhar Mahadevan and Prasad Tadepalli at Carnegie-Mellon [31]; Manfred Warmuth at UC Santa Cruz [44], Oren Etzione at

Carnegie-Mellon and Jonathan Amsterdam at MIT. David Haussler [23] has called for further research by the theoretical machine learning com- munity into the complexity of learning in the presence of background knowledge, and has con- tributed to the results presented in [57]. Benjamin Grosof, at Stanford and IBM T.J. Watson Labs, continues to collaborate on the project, and re- ports several contributions, particularly in the area of defeasible bias, in his forthcoming thesis.

In the cognitive tradition, Pazzani [43] has simulated the learning behaviour of young children, particularly their ability to generalize from a small number of examples. His OCCAM system uses a simple inductive process to acquire a rule from examples, then assumes that the pre- mise predicates of the rule are typically involved in predicting outcomes of the kind to which the conclusion predicates belong. This assumption can be viewed as a determination of sorts, although its precise semantics is hard to ascertain.

Two other research efforts have been aimed specifically at the automatic modification of inductive bias. Work by Larry Rendell's group [50] on the Variable Bias Management System (VBMS) aims at finding the optimal settings on a para- meterized syntactic bias used with his PLS induction system. VBMS attempts to find appropriate syntactic biases (such as 'at most three disjuncts') for recognizable problem classes. Research by Paul Utgoff [68,69] on the STABB (Shift To A Better Bias) system has focussed on adding new terms to the system's vocabulary in order to maintain conjunctive expressibility for the target concept, on the assumption that the terms so generated will be useful in future learning problems.

Research on structured induction at Edinburgh, particularly Alen Shapiro's thesis research [62], has emphasized the value of prior structuring of the domain to reduce the complexity of induction, by breaking the task into a hierarchy of smaller induction problems. Each of the smaller tasks yields a rule, and the rules together form a deeply-structured expert system for the goal concept. Muggleton's DUCE system [40] for inducing propositional theories extends this work by creating the domain structure information automati- cally, either by analysis of examples or by using the expert as an oracle for certain well-defined queries. Muggleton's recent work extending this approach to first-order theories is discussed above.


4. Declarative bias

The basic approach adopted in this paper is to take the hypothesis space as an appropriate intermediate stage between the original unifferentiated mass of prior knowledge and the final stage, that is, the induced theory. We express the hypothesis space as a first-order sentence, hence the term declarative bias. The idea is that, given suitable background knowledge, a system can derive its own hypothesis space, appropriate for its current goal, by logical reasoning of a particular kind. In other words, rather than having to be told what hypotheses to consider for each learning task, the system can figure out what to consider from what it knows about the domain. We therefore view an agent as having initially a very weak, partial theory of the domain of inquiry, a theory which is useless for predictive inference. From further observation, the agent can construct the needed predictive capability by combining prior knowledge with the information contained in its observations. This approach seems to be much more in accord with the nature of human inductive inquiry.

4.1. Basic definitions

The concept language, that is, the initial hypothesis space, is a set W of candidate (concept) descriptions for the concept. Each concept description is a unary predicate schema (open formula) ~ ( x ) , where the argument variable is intended to range over instances. The concept hierarchy is a partial order defined over ~'. The generali ty/ specificity partial ordering is given by the non- strict ordering < , representing quantified impli- cation, where we define (A < B) iff {Vx. A ( x ) B(x)}

An instance is just an object a in the universe of discourse. Properties of the instance are repre- sented by sentences involving a. An instance description is then a unary predicate schema D, where D(a) holds. The set of allowable instance descriptions forms the instance language ~ . The classification of the instance is given by Q(a) or ~ Q ( a ) . Thus the i tn observation, say of a positive instance, would consist of the conjunction D+(ai) A Q(ai). A concept description Cj matches an instance a, iff C/(a~). The latter will be derived, in a logical system, from the description of the instance and the system's background knowledge.

Choosing a particular instance description language corresponds to believing that the instance descriptions in the language contain enough detail to guarantee that no considerations that might possibly affect whether or not an object satisfies the goal concept Q have been omitted from its description. For this reason, we call it the Com- plete Description Axiom (CDA). Its first-order representation is as follows:

Definition 1 (CDA): A (D, < Q) v ( /9/< ~Q) .

D , ~ The heart of any search-based approach to

concept learning is the assumption that the correct target description is a member of the concept language, i.e. that the concept language bias is in fact true. We can represent this assumption in first- order as a single Disjunctive Definability Axiom (DDA):

Definition 2 (DDA): V (o = cj).

c j ~ (Here we abbreviate quantified logical equivalence with ' = ' in the same way we defined ' < '.)

An important notion in concept learning is what Mitchell [35] calls the unbiased version space. This term denotes the hypothesis space consisting of all possible concepts definable on the instance language. A concept is extensionally equivalent to the subset of the instances it matches, hence we have

Definition 3 (Unbiased version space): ( C l C matches exactly some element of 2 ~ ).

As it stands, the extensional formulation of the CDA is inappropriate for automatic derivation from the system's background knowledge. A compact form can be found using a determination [10], a type of first-order axiom that expresses the relevance of one property or schema to another. A determination is a logical statement connecting two relational schemata. The determination of a schema Q by a schema P is written P >- Q, and defined as follows:

Definition 4 (Determination): P >- Q iff Vwx[3y[P(w,y) A P(x , y ) ]

Vz[Q(w,z ) =~ Q(x,z)]].

Determinations involving unary schemata (such

150 S.J. Russell

T

T

G ~(; G H G=H G#H ~-n ~G

F • G&~H ~G&H ~G&-H

F

Fig. 1. Unbiased ver~ion spaces for one and two boolean predicates.

as 'One's age determines whether or not one requires a measles vaccination in case of an outbreak ') are best expressed using truth-valued variables as virtual second arguments. Following [10], the truth-valued variable is written as a prefix on the formula it modifies. The letters O'k... are typically used for such variables. Thus the measles determination is written

A ge ( x , y ) > k Measles VaccineNeeded ( x ).

The addition of truth-valued variables to the language significantly reduces the length of some formulae relevant to our purposes, and allows for uniform treatment.

4.2. Basic theorems

We now give the basic theorems that establish the possibility of automatic derivation of an initial hypothesis space. Proofs are given in detail in [58], and sketched in [57].

T h e o r e m 1: The disjunctive definability axiom corresponding to an unbiased version space is logically equivalent to the complete description assumption.

t ions are iG(x) >- kQ(x) a n d iG(x) A j H ( x ) >- kQ(x).

5. T h e structure of an a u t o n o m o u s learning s y s t e m

The basic procedures in an autonomous learning agent are as follows: • Derive the instnnce language bias from back-

ground knowledge and knowledge of the goal concept Q. From the derivation, we extract a restricted hypothesis space called the tree-structured bias.

• Derive a stronger concept language bias from the tree-structured bias and additional knowledge contained in the concept hierarchy, plus syntactic biases concerning the preferred form of the ultimate concept definition.

• From the concept language bias and the instance descriptions with their classifications, derive a consistent rule for predicting the goal concept in future cases.

These procedures are illustrated in Fig. 2. We now briefly describe the various aspects of our picture of autonomous learning.

5.1. Deriving an initial bias

This section contains brief remarks on the considerations that apply to the process of deriving a suitable determination to form the initial hypothesis space for a concept learning problem. The first requirement is that the instance descriptions for- ming the most specific level of the hypothesis space must be such as to be easily observable by the agent. The second requirement is that the

T h e o r e m 2: The complete description assumption can be expressed as a single determination of the form D ( x , y ) >- k Q ( x ) , where D ( x , Y i) = Di(x) .

Corol lary: The unbiased version space can be expressed as a single determination of the form D ( x , y ) ~ k Q ( x ) .

As an example of the ability of determinations to express hypothesis spaces, consider the simple case of instance languages with one and two boolean predicates (G, and G and H respectively). The unbiased version spaces for these languages appear in Fig. 1. The corresponding determina-

KB

~ deduce ~ Determinations

I preferences I ~

Fig. 2. Information flow in autonomous concept learning.


hypothesis space be as small as possible, since this impinges directly on the cost of the learning task. Below, a theorem is proved that indicates that the bias that the system derives can be quite restric- tive, so that the resulting learning task is relatively simple.

Although, in principle, the inference of the determination could be performed as a resolution proof, a specialized reasoner is more appropriate. What we want to get out of the inference process is a determination for the goal concept such that the left-hand side forms a maximally operational schema. The notion of operationality of a concept definition is central in the literature on explanation-based learning [38,26], where it refers to the utility of a concept definition for recognizing instances of a concept. Our use of the term is essentially the same, since the left-hand side of the determination forms the instance language bias. This means that it should be easy to form a description of the instance within the instance language it generates. For example, to learn the DangerousCarnivore concept we would like to find a bias that refers to visible features of the animal such as size and teeth, rather than to features, such as diet, whose observation may involve considerable cost to the observer. The particular operationality criteria used will clearly depend on the situation and overall goals and capabilities of the agent. In our implementation we adopt the approach taken by Hirsh [24], who expresses knowledge about operationality as a set of meta-level sentences. Effectively, these sentences form an 'evaluation function' for biases, and help to guide the search for a suitable instance language bias.

As well as the operationality of the instance descriptions, the expected cost of doing the concept learning will depend critically on the size of the hypothesis space. A weak bias will mean that a large number of instances must be processed to arrive at a concept definition. Maximizing operationality for our system therefore means minimiz- ing the size of the hypothesis space that is derived from the determination we obtain. The following section describes the computation of the size of the hypothesis space corresponding to a given declarative bias derivation.

But what form does the derivation of a bias take? Since we are beginning with a goal concept for which we must find an operational determination, we must be doing some kind of backward

Q

PI P2 P~ P5

Fig. 3. A bias derivation tree.

chaining. The inference rules used for the chaining will not, however, be standard modus ponens, since we are attempting to establish a universal and the premises used are usually other determinations, as opposed to simple implicative rules. Thus the basic process for deriving a suitable instance language bias is implemented as a backward chaining inference, guided by operationality criteria, and using inference rules appropriate for concluding determinations. These inference rules are given in [54]. An example is the extended transitivity rule, valid for functional rela- tions:

A> -B , B A C > - D ~ - A A C > - D .

An example of a derivation tree is given in Fig. 3. The tree corresponds to the derivation of the determination

P I A P 2 A P 3 A P 4 A P s A P 6 > ' Q .

If the features /'1 through P6 are known to be operational, for example if they are easily ascer- tained through experiment, then the system will have designed an appropriate instance language for the goal concept Q, and hence an initial, 'unbiased ' hypothesis space. It is worth noting that there might be a very large number of features potentially applicable to objects in the domain of Q, so this bias represents a considerable restriction. However, the derivation generates a much stronger restriction ye t

5.2. Tree-structured bias

It is clear that the unbiased hypothesis space derived by the above procedure will not allow successful inductive learning if used 'as is'. Some non-trivial generalization does occur, in the sense that each instance can be generalized to the class of instances with the same description in the derived instance language, but to obtain coverage of the domain the agent would require a number of

152 S.J. Russell

examples exponential in the number of features in the determination. If more knowledge is available than just this determination, then it should be used to further restrict the hypothesis space. It turns out that the determinations used in the derivation of the bias themselves impose a strong additional restriction on the space of possible definitions for the goal concept.

Intuitively, the restriction comes about because the tree structure of the derivation limits the number of ways in which the different features can interact. For example, in figure 3, P1 and P2 cannot interact separately with ~ , but only through the function which combines them. Another way to think about it is to consider q, the value of Q, as a function of the variables Pt through P6 which are the values of P1 through P6. The ' f lat ' bias determination derived above simply states that

q = f ( P l , P2, P3, P4, Ps, P6)

for some boolean function f . The tree-structured derivation in Fig. 3 shows that the form of the function is restricted:

q = f ( g ( h ( P l , P2),P3,J(P4, Ps)),P6) (1)

for some functions f , g, h, j . It is possible to derive a general formula for the number of boolean functions having a given tree structure [57]. For example, the structure in Fig. 3 allows 204304 functions, as compared to about 1019 for the corresponding flat bias. It seems surprising that simply organizing the functional expression for Q into a tree would cause a very large reduction in the number of possible functions. But in fact, the following general result holds:

Theorem 3: For a tree-structured bias whose degree of branching is bounded by a constant k, the number of rules consistent with the bias is bounded by (22')"-1, where n is the number of leaf nodes.

Corollary: Given a tree-structured bias as described above, with probability greater than 1 - a concept can be learned that will have error less than c from only m examples, where

m = 1 [ I n ( 1 1)2k]. , t ta) +(n- That is, the number of examples needed is linear in the number of features in the instance language. Since the size of the 'unbiased ' hypothesis space is

doubly exponential in the number of features, requiring an exponential number of examples, it seems that the tree structure represents a very strong bias, even beyond that provided by the restriction to a circumscribed set of primitive features. For comparison, a strict conjunctive bias also requires a linear number of examples. In addition, having an explicit formula for the size of the hypothesis space from a given derivation allows the system to minimize the size of the hypothesis space by choosing appropriate derivation paths when generating a bias.

To achieve learnability in the sense of Valiant [70], we must find a polynomial-time algorithm for generating hypotheses consistent with the tree- structured bias and a set of examples. Such an algorithm has been found for the case in which the functions at each internal node of the tree are restricted to be monotone (the algorithm uses membership queries rather than randomly selected examples). The general case seems more difficult. The natural process for identifying the correct rule is simply to identify the correct rule for each subtree in a bot tom-up fashion, by generating experiments that vary the features in the subtree, keeping other features constant. Since, by construction, internal nodes of the tree are not easily observable, the induction process is far from trivial. Warmuth (personal communication) has shown that a general solution to this problem would also provide a solution to the predictability problem for k-term D N F formula:. This has been an open problem since 1984. However, it should be noted that, for our purposes, even a demonstration of intractability would be only an inconvenience rather than a fundamental limitation. Our main claim is that prior knowledge can allow successful predictive behaviour from a small number of examples by an autonomous learning agent. The following subsections describe ways in which the hypothesis space can be further restricted to in- crease inductive efficiency.

5.3. A dding additional knowledge

Although the tree-structured bias imposes a strong restriction on the hypothesis space, we are still a few steps away from achieving powerful learning from examples in complex domains. Par- ticularly when the individual features used in the language have large ranges of possible values, the


tree-structured bias derived using a knowledge base of determinations does not allow the learner to generalize quickly, resulting in slow progress in covering the domain. For example, consider the Meta-DENDRAL bias derivation in Fig. 4, adapted from [55]: at the Element node, the learner could be forced to enumerate all 92 naturally-oc- curring elements, thereby creating a highly disjunctive theory. Instead, we would like it to consider appropriate, more general, classes of elements, such as Group IV elements, non-metals, highly electronegative elements, and so on. In standard learning systems, this is achieved using a 'concept hierarchy'. Rather than form a disjunctive rule (say involving Carbon OR Silicon), one 'climbs the generalization tree' by using a more general term such as Group IV element. This gives considerably greater predictive coverage, since a rule for Group IV elements could be formed without having to see examples of all of those elements. However, such generalizations do not come for free: a system designed without regard for the laws of chemistry could easily commit gross errors in generalizing from data. Generaliza- tion to the class of elements with long names would be inappropriate. Therefore, we claim that the use of a given concept hierarchy reflects defi- nite domain knowledge.

It appears that the concept hierarchy above any predicate in a bias derivation reflects knowledge of how the determination involving that predicate came about. In other words, it forms a partial explanation of the determination. This may indi- cate the need for an additional 'phase' in the

induction process using a tree-structured bias: after the tree is constructed, each determination link should be 'explained', by expansion into a local tree structure (possibly consisting of rules as well as determinations), in order to restrict the hypothesis space still further. In this way, the effect of a concept hierarchy appropriate to the situation is obtained.

This expansion technique may also help to al- leviate combinatorial search problems that may arise in trying to find an operational instance language: just as in normal rule-based reasoning, determinations may be chained together to form new determinations that allow 'macro-steps' to be taken in the search space. Once the search has reached a suitable set of leaf nodes, the determinations used can be expanded out again to create a more detailed tree that therefore corresponds to a more restricted hypothesis space.

As we discuss in more detail below, the process of incorporating observations into the tree-structured hypothesis space to learn a rule amounts to identifying the initially unknown function at each internal node of the tree (for example, the functions f , g, h, j in Eq. (1) above). Obviously, if we have extra knowledge constraining the identity of these internal functions, then once a suitable tree has been constructed, this knowledge can be immediately accessed to provide additional guidance for the incorporation of examples. Mitchell [39] has found that an autonomous learning robot can benefit from additional knowledge stating that certain dependencies are monotonic. For example, his robot knows that the moment of a force about

k Break](mol,site)

MSBehaviour(mol,msb)

I

Topolffgy(mol,t) BehaviourOfNodes(mol,bn) /

.....~t-~.......~2~ AtomChemistry(a,acb)

~ d n t s ( m o l , n ) ~ ' ~ Orbi21s(a,o)

Elem!nt(a,e) StructuralFormula(mol.struct)

Fig. 4. Derivation of the Meta-DENDRAL bias.

154 s.J. Russell

a point is determined by the distance from its point of application, but also that the dependence is a monotonically increasing one.

6. Updating the hypothesis space

To complete the edifice built around the notion of declarative bias, it remains to show how the process of updating the hypothesis space with new instances can be implemented as a normal first- order deduction. We first describe the simplest approaches using the Disjunctive Definability Axiom, and its determination form. We then discuss more practical implementations for stronger versions of the concept language bias, in particular the tree-structured bias.

The simple-minded approach to updating the version space is to do forward resolution between the instance observation facts and the disjuncts in the DDA, i.e., the candidate concept descriptions. Effectively, each candidate ~ will resolve against an instance that contradicts it, with the help of the system's background knowledge (the articulation theory Tha). Thus, as more instances are observed, the D D A will shrink, retaining only those concept descriptions that are consistent with all the instances. Classification of a new instance a, using an intermediate version space can be done by a resolution proof for the goals Q(ai) and -~Q(a,) using the current D D A as the database. (Note that these processes are in general only semi-decidable.) The algorithms can be simply stated as follows: Updating the version space (simple DDA method): (1) For each instance description D~(a~) A

k~Q(ai): (a) Resolve the instance description against

each remaining disjunct of the DDA. (b) If a contradiction is found with a disjunct,

remove the disjunct from the DDA. (c) Otherwise do nothing.

(2) If one disjunct remains in the DDA, return it as the concept definition.

(3) If no disjuncts remain, we have a contradiction, and the bias needs to be weakened. 5

Classifying new instances (simple D D A method). (1) Given an instance description D~(a~) (classifi-

cation unknown). (2) Add it to the DDA and attempt to prove a

5 Alternatively, if the domain is noisy, we may wish to allow a certain percentage of classification errors.

contradiction with the positive goal Q(a, ) . If a contradiction appears, a, is a negative instance.

(3) Add it to the DDA and at tempt to prove a contradiction with the negated goal - ,Q(a , ) . If a contradiction appears, a, is a positive instance.

(4) Otherwise, there is insufficient information to classify a,.

In the straightforward method using an explicit DDA, each disjunct of the D D A must be put in conjunctive normal form, since the resolutions are carried out separately between the instances and each disjunct. No useful resolutions are lost, since all the disjuncts of the D D A are mutually inconsistent by definition.

The determination representation for the hypothesis space can also be used directly with a deductive updating approach. This can be illustrated using the simple one-predicate language in Fig. 1. The determination form is

i P ( x ) ~ ' k Q ( x ) .

The truth-valued variables can be handled in a straightforward fashion. The formula is trans- formed into CNF as usual. The standard resolution test for complementary literals is altered to allow for the presence of the truth-valued variables, which effectively unify with the 'sign' of the literal being matched. Hence ~P(a) is complementary to kP(a) with unifier { k / + }. 6 The premises for the forward resolution inference are the determination and instance facts, here one negative and one positive instance:

-~iP(x) V ~ i P ( y ) v ~ k Q ( x ) v k Q ( y ) (2)

e (a ) (3)

Q(a) (4)

(5) -,Q(b). (6)

Resolving 2 and 3 with unifier { x/a , i~ + } we get

~ p ( y ) V ~ k Q ( a ) v kQ(y ) . (7)

Resolving 4 and 7 with unifier { k~ + } we get

~ P ( y ) v Q(y) . (8)

6 The propriety of this treatment is assured, since we can view it as theory resolution with a theory containing equivalences for all literals, such that each literal is equivalent to an extended literal with an extra argument. For example, P(a) becomes P(a, +), --,P(a) becomes P(a, - ).

Prior knowledge an'd dtiiOiioitibUs learning 155

Resolving 2 and 5 with unifier { y/b, i / - } we get

P ( x ) v ~ k Q ( x ) v kQ(b) , (9)

Resolving 6 and 9 with unifier { k / + } we get

P ( x ) V -~Q(x). (10)

8 and 10 together give us the concel~t ti~fifiitioti Q - - P . Clearly, with more than one predicate in the language, the determination method is much more efficient than the DDA method.

5.1. Updating using a tree-structured bias

Clearly, the DDA approach is impractical when the space of hypotheses is too large, Th~ determination form is compact and e(fiei~ht f0r ati unbiased space, but the updating procedure needs to be elaborated considerably to deal with a tree- structured bias. Although any hypothesis space can be searched by techniques that amount to simulating the current-best-hypothesis search of Winston [72], the obvious direction is to take advantage, as Mitchell did; of the generalization partial ordering between concept descriptions to express compactly the set of hypotheses consistent with both the examples and the original bias: The tree structure of the hypothesis space allows for a somewhat more localized updating process than is the case with Mitchell's candidate elimination process. Essentially, the tree-structured bias presents a set of smaller learning problems, namely to identify the unknown function at each internal node in the tree and at the root node. The identification can be done using any of a number of inductive methods. The most straightforward is to use a version space at each node, with the classification information needed to make updates being gradually propagated from the top and bottom of the tree as new examples come in. Additional constraints, such as concept hierarchies, monotonic dependencies, or even complete theories for the internal nodes, can be easily incorporated into such an algorithm. A preliminary version of this algorithm has been implemented and tested, but is still under development, and its complexity is not yet known. As pointed out above, a polynomial- time algorithm for finding a hypothesis consistent with a set of examples would solve a long-standing open problem in learning theory, namely the predictability of small DNF formulae. In practice, it is not necessary to find completely-specified rules

consistent with the data: often it is sufficient simply to constrain the function at a node in the tree, so that for example after a few cases one might discover that the dependence of weight on calorie intake was monotonic, and this could enable successful classification of future cases.

A perhaps more radical and interesting approach is to solve each identification problem using a ~o~i~t ionis t network, one for each node in the tree. In cffs~ where little or no structuring information is available f0Y the version space for a given node, the connectionist approach can help to induce additional structure and generate new terms to simplify the overall concept description. From the point of view of the connectionist enter- prise, the knowledge-based derivation of the tree- situctur~! bias provides an ideal way to integrate prior kfit~ledg~ into a connectionist learning system, since it strongly restricts the typically enor- mous weight spaces that would otherwise be searched. A connectionist approach to learning the node functions has the additional advantage that information can propagate through the tree fasier, since each subnetwork will classify its inputs sooner than the least-commitment version- space learning algorithm, and will be able to ~olefate b~tter the inevitable noise this will entail. An expetirnerttal research program has been ini- tiated to explore this avenue [58].

7. Future work

Several meaty theoretical and implementat ion tasks remain before the model of autonomous learning can be fully realized. The tasks form an approximately well-structured Sequence as follows:

Extension of the bias derivation subsystem A pilot version of the bias derivation phase is

already in operation, providing proof-of-concept, but needs to be extended to handle non-trivial operationality theories and new inference rules, and to take into account version space size using the formulae developed in [57]. As new inference rules are introduced, the formulae for version space size will need to be extended.

Development of domain theories We are currently working to build a knowledge

base of determinations concerned with the various

156 S.J. Russell

aspects of molecular structure and resulting physi- cal and chemical properties. We intend to use this as a basis for demonstrating autonomous knowledge-guided learning and experimentation on a variety of goal concepts. After this, we expect to work with other experts in the domains of mecha- nical device design and diagnosis, creditworthiness assessment, and medicine. Another possibility being explored with the robotics faculty at Berke- ley is that of providing a robot with a partial theory of its environment and capabilities, and having it generate the necessary practical experiments to provide the information required by its problem-solver. A ball-throwing robot, for example, might run a few timing tests to determine the acceleration due to gravity in its neighbourhood.

Learning determinations Determination knowledge bases can be con-

structed by induction over standard rule bases and case libraries, as discussed in [54]. However, the algorithms given there are quite rudimentary, and do not use prior knowledge to reduce search. It is likely that adapting the above ideas to the acquisition of determinations from examples should be fairly straightforward, since the unbiased hypothesis space for determinations is the powerset of the set of all predicates applicable to the domain of the goal concept. Some modifications will be needed to deal with the fact that determinations must be learned from pairs of matching examples.

Bias shift as nonmonotonic reasoning It seems clear that the kind of reasoning lead-

ing to a strong inductive bias can seldom be guaranteed correct; typically, the premises in such derivations should be viewed as defaults. As in the STABB system [68,69], when experience of actual cases contradicts an inductive bias the system must fall back to a weaker bias. This process, called bias shift, can be formally modelled as a prioritized, non-monotonic reasoning process in a formalism developed by Grosof [21], and demon- strated in [55]. When the instances, which are usually considered as having the highest priority, contradict the original bias, then a weaker, lower- priority bias inference can go ahead, itself subject to revision if necessary. For example, in finding rules to predict the weather, one's knowledge of physics would suggest ignoring the day of the week, but when one cannot otherwise explain a weekly variation, one might add in the further

consideration of weekday smog production. Work on the implementation of non-monotonic reasoning, for example [18], has progressed to a point at which it is now feasible to extend our model of autonomous learning to include this kind of bias shift capability.

Explicit uncertainty As a complement to the ability to use non-

monotonic inference to handle uncertainty in biases, it is important to be able to handle premises with attached probabilities, and to be able to generate explicitly probabilistic rules if necessary. Determinations that are less than completely certain are called partial determinations, and their probabilistic definition has been developed [54]. Mahadevan and Tadepalli [31] have shown that partial determinations in a background theory can also constrain a learning problem so as to be tractable, given a sufficiently small deviation from complete certainty. The certainty of a determination can be increased by adding extra premises, but this reduces its utility as an inductive bias by enlarging the corresponding hypothesis space. It should therefore be possible to find appropriate trade-offs between certainty and strength of bias. In addition, quantified uncertainty in the bias should enable the system to make principled choices between detailed, high-certainty but possibly overfitted theories, and general, efficient but possibly inaccurate theories [48].

Incorporation into an autonomous agent As part of the RALPH project at Berkeley, a

simulated, multi-agent, non-deterministic environment has been built and serves as the testbed for our research on ralphs. Agents in this environment must process raw sensory data into theories of the environment which are used to make decisions. Inductive reasoning must take place at both the base-level (theories of the world) and the meta- level (theories for control reasoning). As each agent acquires a knowledge base of determinations, its inductive performance for subsequent tasks will improve. We will examine the effect of the nature of the environment on the efficacy of this bootstrapping process.

New term generation In the inductive context, the main effect of new

term generation will be to simplify the description of interesting concepts, and thereby to make them


more accessible to resource-bounded inductive mechanisms which use a strong syntactic bias. Muggleton and Buntine [42] have implemented a system, CIGOL, that uses reverse resolution to generate the most compact theory that implies a set of ground facts. In the reverse resolution process, new predicates may be introduced that have partial definitions in terms of existing predicates. We have re-implemented the CIGOL system, and intend to analyse its behaviour when given a background theory consisting partly of determinations. The effect of combining new-term generation methods with the knowledge-based inductive methods described above will be investigated in the RALPH system. A particular goal of this investigation is to find a 'synthesis route' for the notions of position, space and motion, given that the agent begins with only sensory primitives.

Scientific inference A complete model, however skeletal, of autono-

mous learning cannot avoid impinging on the study of scientific inference, as carried out by methodol- ogists and philosophers of science for many centu- ries. Appropriately, and in anticipation of our efforts at mechanization, Putnam [46] urges us to abandon the search for a confirmation theory giving the relationship between data and degree of belief in a hypothesis; instead, we are to study the process of selecting a consistent theory from among a set of possibilities. As in other proposals, however, including confirmation theory itself, no attempt is made to identify mechanisms leading to the generation of a hypothesis or set of hypotheses for assessment or selection. Prior knowledge clearly plays an important role in the generation of hypotheses in scientific research and discovery. Several AI researchers have examined 'discovery learning' [27,28,29,51,73], but to date not much work has been carried out to model the use of existing domain knowledge to guide the process of investigation. Machine learning investigators and scientists alike may spend many hours deciding on an appropriate space of hypotheses and an appropriate description language for examples, before embarking on actual experiments. For instance, a chemist attempting to understand mole- cule behaviour in a mass spectroscope ignores the nuclear spin states and isotopic species of her sample's molecules, concerning herself only with topological structure. An specialist in nuclear

magnetic resonance does exactly the opposite. Without such skills, the range of possible experiments would be vast.

We hope to study cases of scientific experimentation to ascertain the possible state of knowledge of the experimenters leading to the selection of an original set of hypotheses, following the lines sketched above for the analysis of Meta-DEN- DRAL. Examples include the development of the laws of gravity and of electrostatic attraction, both simple theories, and the refinement of our under- standing of gene expression mechanisms, a more complex task. The latter investigation will be based on the already well-developed formalization of the domain theory possessed by the original investigators, reported by Karp [25].

8. Summary

Lest it be swamped by vast hypothesis spaces, an autonomous system must use all the knowledge it possesses to make its inductive learning maximally effective. The declarative expression and deductive generation of bias holds promise for allowing the creation of autonomous learning systems. Autonomy is clearly essential in many situa- tions, but in addition the proposed system will relieve humans of the exceedingly difficult task of hand-coding a bias for each individual learning task. 7 It is fair to say that this bottleneck has been the major reason for the non-existence of learning systems as general add-ons to performance programs. It is hoped that this work will also shed light on the process of bias creation, and on the more general problem of hypothesis generation in scientific research.

The kind of system being proposed is one in which any and all available knowledge can be brought to bear on each learning problem, so that as more is learnt, more can be learnt. A theoretical basis has been described, directions for further work have been mapped out and a system architecture has been outlined. Although the ultimate goals of the research are long-term, it is hoped that payoff in the form of significantly more applicable machine learning systems will appear in the near future.

7 Quinlan [47] has reported a span of 3 months for the creation of an appropriate bias for his chess end game application.

158 S.J. Russell

References

[1] D. Anghiin and C.H. Smith, Inductive inference: Theory and methods, Computing Surveys 15 (3) (1983) 237-269.

[2] L. Blum and M. Blum, Toward a mathematical theory of inductive inference, Information and Control 28 (1975) 125-155.

[3] J.S. Bruner, J.J. Goodnow and G.A. Austin, A Study of Thinking (Wiley, New York, 1956).

[4] B.G. Buchanan and T.M. Mitchell, Model-directed learning of production rules, in: D.A. Waterman and F. Hayes-Roth, eds., Pattern-Directed Inference Systems (Academic Press, New York, 1978).

[5] B.G. Buchanan, T.M. Mitchell, R.G. Smith and C.R. Johnson, Jr., Models of learning systems, Technical report STAN-CS-79-692, Computer Science Department, Stan- ford University, Stanford, CA (1979).

[6] A. Bundy, B. Silver and D. Plummer, An analytical comparison of some rule-learning programs, Artificial Intelli- gence 27 (1985).

[7] W. Buntine, Generalized subsumption and its application to induction and redundancy, Proc. of ECAI-86, Brighton (1986).

[8] E. Charniak and D. McDermott, Introduction to Artificial Intelligence (Addison-Wesley, Reading, MA, 1985).

[9] T. Davies, Analogy, Informal Note CSLI-IN-85-4, CSLI, Stanford, CA. (1985).

[10] T.R. Davies and S.J. Russell, A logical approach to reasoning by analogy, Proc. of IJCAI-87, Milan (Morgan Kaufmann, Los Altos, CA, 1987).

[11] T.R. Davies and S.J. Russell, Relevance and uniformity: Extensions to determination-based learning. Unpublished manuscript (1988).

[12] T.G. Dietterich, Learning at the knowledge level, Machine Learning 1 (3) (1986).

[13] N.S. Flann and T.G. Dietterich, Selecting appropriate representations for learning from examples, Proc. of the Fifth National Conference on Artificial Intelligence (Morgan Kaufmann, Philadelphia, PA, 1986).

[14] R.M. Friedberg, A learning machine: Part 1, IBM Journal 2 (1958) 2-13.

[15] R. Friedberg, B. Dunham and T. North, A learning machine: Part 2, IBM Journal of Research and Development 3 (1959) 282-287.

[16] L.-M. Fu, Learning Object-level and Meta-level Knowl- edge for Expert Systems, Ph.D. thesis, Computer Science Department, Stanford University, Stanford, CA (1985).

[17] M.R. Genesereth, An overview of meta-level architecture, Proc. of AAAI-83, Austin, TX (Morgan Kaufmann, Los Altos, CA, 1983) 119-124.

[18] M.L. Ginsberg, A circumscriptive theorem prover: Pre- liminary report, Proc. of the Seventh National Conference on Artificial Intelligence, Minneapolis, MN (Morgan Kaufmann, Los Altos, CA, 1988).

[19] E.M. Gold, Language identification in the limit, Informa- tion and Control 10 (1967) 447-474.

[20] N. Goodman, Fact, Fiction and Forecast, (Harvard Uni- versity Press, Cambridge, MA, 1955).

[21] B.N. Grosof, Non-monotonic theories: Structure, inference, and applications (working title). Ph.D. thesis (in preparation), Stanford University, Stanford, CA.

[22] D. Haussler, Quantifying inductive bias: AI learning algorithms and valiant's learning framework, Technical report, Department of Computer Science, University of California, Santa Cruz, CA (1988).

[23] D. Haussler, Theoretical results in machine learning, In- vited talk, Fifth International Machine Learning Con- ference, Ann Arbor, MI (1988).

[24] H. Hirsh, Explanation-based generalization in a logic pro- gramming environment, Proc. of the Tenth International Joint Conference on Artificial Intelligence, Milan (1987).

[25] P.D. Karp, A process-oriented model of bacterial gene regulation, Unpublished manuscript, Knowledge Systems Laboratory, Stanford University, Stanford, CA.

[26] R.M. Keller, Defining operationality for explanation- based learning, Proceedings of the Sixth National Con- ference on Artificial Intelligence, Seattle, WA (1987).

[27] D.B. Lenat, AM: An artificial intelligence approach to discovery in mathematics as heuristic search, Ph.D. thesis, Computer Science Department, Stanford University, Stan- ford, CA (1976).

[28] D.B. Lenat, Theory formation by heuristic search: the nature of heuristics II: background and examples, Artifi- cial Intelligence 21 (1983) 31 59.

[29] D.B. Lenat, EURISKO: A program that learns new heuristics and domain concepts, The nature of heuristics Ill: Program design and results, Artificial Intelligence 21 (1983).

[30] D. Lenat, M. Prakash and M. Shepherd, CYC: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks, AI Magazine 6 (1986) 65-85.

[31] S. Mahadevan and P. Tadepalli, On the tractability of learning from incomplete theories, Proe. of the Fifth Inter- national Machine Learning Conference, Ann Arbor, MI (Morgan Kaufmann, Los Altos, CA, 1988).

[32] R.S. Michalski, A theory and methodology of inductive learning, Artificial Intelligence 20 (2) (1983).

[33] J.S. Mill, (1843), System of Logic, Book III Ch XX 'Of Analogy', in: Vol. VIII of Collected Works of John Stuart Mill (University of Toronto Press, 1973).

[34] T.M. Mitchell, Version spaces: An approach to concept learning, Ph.D. thesis, Computer Science Department, Stanford University, Stanford, CA (1978).

[35] T.M. Mitchell, The need for biases in learning generalizations, Technical report TR CBM-TR-117, Computer Sci- ence Department, Rutgers University, New Brunswick, NJ (1980).

[36] T.M. Mitchell, Generalization as search, Artificial Intelli- gence 18 (2) (1982) 203-226.

[37] T.M. Mitchell, P. Utgoff and R. Banerji, Learning by experimentation: Acquiring and refining problem-solving heuristics, in: J.G. Carbonell, R. Michalski and T. Mitchell eds.. Machine Learning: An Artificial Intelligence Ap- proach (Tioga Press, Palo Alto, CA, 1983).

[38] T.M. Mitchell, R.M. Keller and S.T. Kedar-Cabelli, Ex- planation-based generalization: A unifying view. Machine Learning 1 (1986) 47-80.

[39] T.M. Mitchell, Can we build learning robots? Proc. of the workshop on representation and learning in an autonomous agent, Lagos, Portugal (in press).

[40] S.H. Muggleton, DUCE: An oracle based approach to

Prior knowledge and autonomous learning i 59

constructive induction, Proc. of the Tenth International Joint Conference on Artificial Intelligence, Milan, Italy (1987).

[41] S.H. Muggleton, A strategy for constructing new predicates in first-order logic, Proc. of the Third European Working Session on Learning (Pitman, Glasgow, 1988).

[42] S.H. Muggleton and W. Buntine, Machine invention of first-order predicates by inverting resolution, Proc. of the Fifth International Machine Learning Conference, Ann Arbor, MI (Morgan Kaufmann, Los Altos, CA, 1988).

[43] M. Pazzani, M. Dyer and M. Flowers, The role of prior causal theories in generalization, Proc. of the Fifth Na- tional Conference on Artificial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[44] L. Pitt and M. Warmuth, Prediction-preserving reducibil- ity, Technical report UCSC-CRL-88-26, Computing Re- search Laboratory, University of California, Santa Cruz, CA (1988).

[45] G.D. Plotkin, A note on inductive generalization, in: B. Meltzer and D. Michie, eds., Machine Intelligence 5 (Elsevier, New York, 1970).

[46] H. Putnam, Probability and confirmation, in: Mathe- matics, Matter and Method (Cambridge University Press, Cambridge, 1975).

[47] J.R. Quinlan, Learning efficient classification procedures and their application to chess end games, in: J.G. Carbonell, g. Michalski and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach (Tioga Press, Palo Alto, CA, 1983).

[48] J.R. Quinlan, Induction of decision trees, Machine Learn- ing 1 (1) (1986) 81-106.

[49] L. Rendell, A general framework for induction and a study of selective induction, Machine Learning 1 (1986).

[50] L. Rendell, R. Seshu and M. Tcheng, Dynamically variable bias management for robust concept learning, Proc. of the Tenth International Joint Conference on Artificial Intel- ligence, Milan, Italy (Morgan Kaufmann, Los Altos, CA, 1987).

[51] D. Rose and P. Langley, Chemical discovery as belief revision, Machine Learning 1 (1986) 423-451.

[52] S.J. Russell, The compleat guide to MRS, Technical Re- port No. STAN-CS-85-1080, Stanford University, Stan- ford, CA (1985).

[53] S.J. Russell, Preliminary steps toward the automation of induction, Proc. of the Fifth National Conference on Artifi- cial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[54] S.J. Russell, Analogical and inductive reasoning, Ph.D. thesis, Computer Science Department, Stanford Univer- sity, Stanford, CA (1986).

[55] S.J. Russell and B.N. Grosof, A declarative approach to bias in concept learning, Proc. of the Sixth National Con- ference on Artificial Intelligence, Seattle, WA (1987).

[56] S.J. Russell and D. Subramanian, Mutual constraints on

representation and inference, in: P. Brazdil, ed., Proc. of the Workshop on Machine Learning, Meta-Reasoning and Logics, Sesimbra, Portugal, (1988).

[57] S.J. Russell, Tree-structured bias, Proc. of the Seventh National Conference on Artificial Intelligence, Minneapolis, MN (Morgan Kaufmarm, Los Altos, CA, 1988).

[58] S.J. Russell, The use of knowledge in analogy and induction (Pitman, London, 1989).

[59] S.J. Russell, Execution architectures and compilation, Proc. of the Eleventh Int. Joint Conference on Artificial Intelligence, Detroit, MI (Morgan Kaufmarm, Los Altos, CA, 1989).

[60] J.C. Schlimmer, Incremental adjustment of representations for learning, Proc. of the Fourth International Workshop on Machine Learning, University of California, Irvine, CA (Morgan Kaufman, Los Altos, CA, 1987).

[61] E.Y. Shapiro, Inductive inference of theories from facts, Technical Report 192, Department of Computer Science, Yale University, New Haven, CT (1981).

[62] A. Shapiro, Structured Induction (Kluwer Academic Pub- lishers, Amsterdam, 1987).

[63] H.A. Simon, The Sciences of the Artificial (MIT Press, Cambridge, MA, 1982).

[64] H.A. Simon, Why should machines learn? in: J.G. Carbonell, R. Michalski and T. Mitchell, eds., Machine Learning: an Artificial Intelligence Approach (Tioga Press, Alto, CA, 1983).

[65] H.A. Simon and G. Lea, Problem solving and rule induction: A unified view, in: L.W. Gregg, ed., Knowledge and Cognition (Erlbaum, Hillsdale, N J, 1974).

[66] D. Subramanian and J. Feigenbaum, Factorization in experiment Generation, Proc. of the Fifth National Con- ference on Artificial Intelligence, Philadelphia, PA (Morgan Kaufmann, Los Altos, CA, 1986).

[67] J.D. Ullman, Principles of Database Systems (Computer Science Press, 1983).

[68] P.E. Utgoff, Shift of Bias for Inductive Concept Learning, Ph.D. thesis, Computer Science Department, Rutgers Uni- versity, New Brunswick, NJ (1984).

[69] P.E. Utgoff, Shift of bias for inductive concept learning, in: J.G. Carbonell, R. Michalski and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach, Vol. II (Morgan Kaufmann, Los Altos, CA, 1986).

[70] L.G. Valiant, A theory of the learnable, Communications of the ACM 27 (1984) 1134-1142.

[71] S. Watanabe, Knowing and guessing: A formal and quanti- tative study (Wiley, New York, 1969).

[72] P. Winston, Learning structured descriptions from examples, Ph.D. thesis, Artificial Intelligence Laboratory, Mas- sachusetts Institute of Technology, Cambridge, MA (1970).

[73] J.M. Zytkow and H.A. Simon, A theory of historical discovery. The construction of componential models, Machine Learning 1 (1986) 107-136.

prior knowledge and autonomous learning

Documents