learning holy grail of ai. if we can build systems that learn, then we can begin with minimal...

35
Learning Holy grail of AI. If we can build systems that learn, then we can begin with minimal information and high-level strategies and have systems better themselves. Avoid the “knowledge engineering bottleneck” where everything must be hand-coded. Effective learning is very difficult.

Upload: garey-gibbs

Post on 31-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Learning

Holy grail of AI. If we can build systems that learn,

then we can begin with minimal information and

high-level strategies and have systems better

themselves. Avoid the “knowledge engineering

bottleneck” where everything must be hand-coded.Effective learning is very difficult.

Goal

any change in a system that allows it to perform

better the second time on repetition of the same task

or on another task drawn from the same population

(Herbert Simon, 1983).

Machine Learning

Symbol-based: A set of symbols represents the

entities and relationships of a problem domain.

Infer useful generalizations of conceptsConnectivist approach: Knowledge is represented

by patterns in a network of small, simple processing

units. Recognize invariant patterns in data and

represent them in the structure.

Machine Learning (cont'd)

Genetic algorithms: Population of candidate

solutions which mutate, combine with one another,

and are selected according to a fitness measure.Stochastic methods: New results are based on both

the knower's expectation and the data (Bayes' rule).

Often implemented using Markov processes.

Types of Learning

Supervised learning: Training examples both

positive and negative, are classified by a teacher for

use by the learning algorithm.Unsupervised learning: Training data not used

Category formation, or conceptual clustering are

examples.Reinforcement learning: Agent receives feedback

from the environment.

Categorization: Symbol-based

What is the data?What are the goals?How is knowledge represented?What is the concept space?What operations may be performed on concepts?How is the concept space searched (heurisitics)?

Example – Arch recognition

Problem: How to recognize the concept of 'arch' from

building blocks (Winston).SymbolistSupervised learningBoth positive and negative examples (near-misses)KR is by semantic networksGraph modification, node generalizationSearch is data-driven

Example (cont'd)

part(arch, x), part(arch, y), part(arch, z)

type(x, brick), type(y, brick), type(z, brick)

supports(x, z), supports(y,z)

Example (cont'd)

part(arch, x), part(arch, y), part(arch, z)

type(x, brick), type(y, brick), type(z, pyramid)

supports(x, z), supports(y,z)

Example (cont'd)

Background knowledge: isa(brick, polygon),

isa(pyramid, polygon)

Generalization:

part(arch, x), part(arch, y), part(arch, z)

type(x, brick), type(y, brick), type(z, polygon)

supports(x, z), supports(y,z)

Negative Example: Near Miss

part(arch, x), part(arch, y), part(arch, z)

type(x, brick), type(y, brick), type(z, brick)

supports(x, z), supports(y,z)

touches(x,y), touches(y,x)

Generalization

part(arch, x), part(arch, y), part(arch, z)

type(x, brick), type(y, brick), type(z, brick)

supports(x, z), supports(y,z)

~touches(x,y), ~touches(y,x)

Version Space Search (Mitchell)

The problem is to find a general concept (or set of

concepts) that includes the positive examples and

excludes the negative ones.Symbolist Supervised learningBoth positive and negative examplesPredicate calculusGeneralization operationsSearch is data driven

Generalization Operators

Replace constant with variable:

color(ball, red) -> color(X,red)

Drop conjuncts:

shape(X,round)^size(X,small)^color(X,red) ->shape(X,round)^color(X,red)

Add disjunct:

shape(X,round)^color(X,red) ->shape(X,round)^(color(X,red) v color(X,blue)

Replace property by more general property:

color(X,red) -> color(X, primary_color)

More General Concept

Concept p is more general than concept q (or p

covers q) if the set of elements that satisfy p is a

superset of the set of elements that satisfy q. If p(x)

and q(x) are descriptions that classify objects as

positive examples, then

p(x) -> positive(x) |= q(x) -> positive(x).

Version Space

Version space is the set of all concept descriptions

that that are consistent with the training examples.

Mitchell created three algorithms for finding the

version space: specific to general search, general to

specific search, and the candidate elimination

algorithm which works in both directions.

Specific to General Search

S = {first positive training instance};

N = {}; // Set of all negative instances seen so far

for each positive instance p {

for every s ∊ S, if s doesn't match p, replace s in S with its most

specific generalization that matches p;

Delete from S all hypotheses more general than others in S;

Delete from S all hypotheses that match any n ∊ N;

}

For every negative instance n {

Delete all hypotheses from S that match n;

N = N u {n};

}

General to Specific Search

G = {most general concept in the concept space};

P = {}; // Set of all positive instances seen so far

for each negative instance n {

for every g ∊ G, if g matches n, replace g in G with its most specific

specialization that doesn't match n;

Delete from G all hypotheses more specific than others in G;

Delete from G all hypotheses that fail to match some p ∊ G;

}

For every positive instance g {

Delete all hypotheses from G that fail to match p;

P = P u {p};

}

Candidate Elimination Algorithm

G = {most general concept in the concept space};

S = {first positive training instance};

for each new positive instance p {

Delete from G all hypotheses that fail to match p;

for every s ∊ S, if s doesn't match p, replace s in S with its most

specific generalization that matches p;

Delete from S all hypotheses more general than others in S;

Delete from S all hypotheses that match any n ∊ N;

}

CAE (cont'd)

for each negative instance n {

Delete from S all hypotheses that match n;

for every g ∊ G, if g matches n, replace g in G with its most specific

specialization that doesn't match n;

Delete from G all hypotheses more specific than others in G;

Delete from G all hypotheses that fail to match some p ∊ G;

}

If G == S and both are singletons, the algorithm has found a single

concept that is consistent with the data and the algorithm halts.

If G and S become empty, there is no concept that satisfies the data.

Candidate Elimination Algorithm

G should always be a superset of S, and the

concepts that lie between them satisfy the data.Incremental in nature – can process one training

example at a time and form a usable, though

incomplete, generalization.Is sensitive to noise and inconsistency in the set of

training data.Essentially breadth-first search – heuristics can be

used to trim the search space.

LEX: Integrating Algebraic Exprs.

LEX (Mitchell, et al.) integrates algebraic expressios

by starting with an initial expression and then

searching the space of expressions until it finds an

equivalent expression with no integral signs. The

system induces heuristics that improve its

performance based on data obtained from its

problem solver.

LEX (cont'd)

The operations are the rules of expression

transformation:

OP1: ∫r f(x) dx -> r ∫ f(x) dxOP2: ∫u dv -> uv - ∫ v duOP3: 1 * f(x) -> f(x)OP4: ∫ f

1(x) + f

2(x) dx -> ∫ f

1(x) + ∫ f

2(x)

Heuristics

Heuristics are of the form:

If the current problem state matches P then apply

operator O with bindings B.

Example:

If a problem state matches ∫ transcendental(x) dx,

then apply OP2 with bindings

u = x

dv = transcendental(x) dx

Symbol Hierarchy

There is a hierarchy of symbols and types: cos, trig,

transcendental, etc.

LEX Architecture

LEX consists of four components:A generalizer that uses the Candidate Elimination

Algorithm to find heuristics,A problem solver that produces traces of problem

solutions, A critic that produces positive and negative

instances from the problem trace, andA problem generator that produces new candidate

problems.

How it works

LEX maintains a version space for each operator.

The version spaces represents the partially learned

heuristic for that operator. The version space is

update from the positive and negative examples

generated by critic.

The problem solver builds a tree of the space

searched in solving an integration problem. It does

best first search using the partial heuristics.

How it works (cont'd)

To decide if an example if positive or negative is an

example of the credit assignment problem. After

solving a problem, LEX finds the shortest path from

the input to the solution. Those operators on the

shortest path are classified as positive, and those

that are not are classified as negative. Since the

search is not admissible, the path may not actually

be the shortest one.

ID3 Decision Tree Algorithm

A different approach to machine learning is to

construct decision trees. At each node we test one

property of the object and proceed to the proper

child node, until reaching a leaf, at which point we

can classify the object. We try to construct the best

decision tree, the one with the fewest nodes

(decisions). Here there many be many categories,

not just positive and negative.

ID3

Problem: Classify a set of instances based on their

values of given properties.SymbolistSupervised learningEach instance is classified to a finite typeKR is the tree and the operations are tree creation.All instances must be known in advance (non-

iterative)

Simple Tree Formation

Choose a property.The property divides the set of examples up into

subsets depending on their value of that property.Recursively create a sub-tree for each subset.Make all the sub-trees be children of the root which

tests the given property.

Caveat

The tree that is formed is highly dependent on the

order in which the properties are chosen. The idea is

to chose the most informative property first, and use

that to sub-divide the space of examples. This leads

to the best (smallest) tree.

Information Theory

The amount of information in a message (Shannon)

is a function of the probability of occurrence p of

each possible message, namely -log2(p). Given a

universe of messages M = {m1, m

2, ..., m

n} and a

probability, p(mi), for the occurrence of each

message, the expect information content of a

message M is:

I[M] = (∑n i=1

-p(mi) log

2(p(m

i))) = E[-log

2p(m

i)]

Choosing the Property

The information gain provided by choosing property

A at the root of the tree is equal to the total

information of the tree minus the amount of

information needed to complete the classification of

the tree. The amount of information needed to

complete the tree is defined at the weighted average

of the information in all its subtrees.

Choosing the Property (cont'd)

Assuming a set of training instances C, if we make

property P with n values the root of the tree, then C

will be partitioned into subsets {C1, C

2, ..., C

n}. The

expected value of the information needed to

complete the tree is:

E[P] = ∑n i=1

|Ci| / |C| * I[C

i]

and the expected information to complete the tree is:

gain(P) = I[C] - E[P].