cs 484 – artificial intelligence1 announcements list of 5 source for research paper homework 5 due...

CS 484 – Artificial Intelligence 1

Announcements

• List of 5 source for research paper• Homework 5 due Tuesday, October 30• Book Review due Tuesday, October 30

Classification problems and Machine Learning

Lecture 10


EnjoySport concept learning task

• Given• Instances X: Possible days, each described by the attributes

• Sky (with possible values Sunny, Cloudy, and Rainy)• AirTemp (with values Warm and Cold)• Humidity (with values Normal and High)• Wind (with values Strong and Weak)• Water (with values Warm and Cool), and• Forecast (with values Same and Change)

• Hypothesis H: Each hypothesis is described by a conjunction of constraints on the attributes. The constraints may be "?", "Ø", or a specific value

• Target concept c: EnjoySport : X → {0,1}• Training Examples D: Positive or negative examples of the target function

• Determine• A hypothesis h in H such that h(x) = c(x) for all x in X


Find-S: Finding a maximally Specific Hypothesis (review)

1. Initialize h to the most specific hypothesis in H2. For each positive training instance x

• For each attribute constraint ai in h• If the constraint ai is satisfied by x• Then do nothing• Else replace ai in h by the next more general constraint that is

satisfied by x

3. Output hypothesis h• Begin: h ← <Ø, Ø, Ø, Ø, Ø, Ø>

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Warm Same Yes

2 Sunny Warm High Strong Warm Same Yes

3 Rainy Cold High Strong Warm Change No

4 Sunny Warm High Strong Cool Change Yes


Candidate Elimination

• Candidate elimination aims to derive one hypothesis which matches all training data (including negative examples).

{<Sunny, Warm, ?, Strong, ?, ?>}

{<Sunny, ?, ?, ?, ?, ?>}, {<?, Warm, ?, ?, ?, ?>} G:

S:

<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>


Candidate-Elimination Learning Algorithm

• Initialize G to the set of maximally general hypotheses in H• Initialize S to the set of maximally specific hypotheses in H• For each training example d, do

• If d is a positive example• Remove from G any hypothesis inconsistent with d• For each hypothesis s in S that is not consistent with d

• Remove s from S• Add to S all minimal generalizations h of s such that

• h is consistent with d, and some member of G is more general than h• Remove from S any hypothesis that is more general than another hypothesis

in S• If d is a negative example

• Remove from S any hypothesis inconsistent with d• For each hypothesis g in G that is not consistent with d

• Remove g from G • Add to G all minimal specializations h of g such that

• h is consistent with d, and some member of S is more specific than h• Remove from G any hypothesis that is less general than another hypothesis

in G


Example

• G0 ← {<?,?,?,?,?,?>}

• G1 ←

• G2 ←

• S0 ← {< Ø, Ø, Ø, Ø, Ø, Ø>}

• S1 ←

• S2 ←

E1 = <Sunny, Warm, Normal, Strong, Warm, Same> positive

E2 = <Sunny, Warm, High, Strong, Warm, Same> positive


Example (cont. 2)

• G2 ←

• G3 ←

• S2 ←

• S3 ←

E3 = <Rainy, Cold, High, Strong, Warm, Change> negative


Example (cont. 3)

• G3 ←

• G4 ←

• S3 ←

• S4 ←

E4 = <Sunny, Warm, High, Strong, Cool, Change> positive


Decision Tree Learning• Has two major benefits over Find-S and Candidate

Elimination• Can cope with noisy data• Capable of learning disjunctive expressions

• Limitation• May be many valid decision trees given the training

data• Prefers small trees over large trees

• Apply to board range of learning tasks• Classify medical patients by their disease• Classify equipment malfunctions by their cause• Classify loan applicants by their likelihood of

defaulting on payments


Decision Tree Example

Outlook

Humidity WindYes

SunnyOvercast

Rain

Yes YesNoNo

High Normal Strong Weak

Days on which to play tennis


Decision Tree Induction (1)

• Decision tree induction involves creating a decision tree from a set of training data that can be used to correctly classify the training data.

• ID3 is an example of a decision tree learning algorithm.

• ID3 builds the decision tree from the top down, selecting the features from the training data that provide the most information at each stage.


Decision Tree Induction (2)

• ID3 selects attributes based on information gain. • Information gain is the reduction in entropy caused by a

decision. • Entropy is defined as:

H(S) = - p1 log2 p1 - p0 log2 p0 • p1 is the proportion of the training data which are positive examples

• p0 is the proportion which are negative examples

• Intuition about H(S)• Zero (min value) when all the examples are the same (positive or

negative) • One (max value) when half are positive and half are negative.


Example – Training DataDay Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No


Calculate Information Gain• Initial Entropy

• All examples in one class• 9 positive examples, 5 negative examples

• H(init) = -.643 log2 .643 - .357 log2 .357 = 0.940• Calculate Entropy for each attribute and then combine as a weighted

sum• Entropy of "Outlook"

• Sunny• 5 examples, 2 positives, 3 negatives• H(Sunny) = -(2/5) log2 (2/5) – (3/5) log2 (3/5) = 0.971

• Overcast• 4 examples, 4 positives, 0 negatives• H(Overcast) = -1 log2 (1) – (0) log2 (0) = 0 (0 log2 0 is defined as 0)

• Rain• 5 examples, 3 positives, 2 negatives• H(Rain) = -(3/5) log2 (3/5) – (2/5) log2 (2/5) = 0.971

• H(Outlook) = .357(0.971) + .286(0) + .357(0.971) = 0.694• Information Gain = H(0) – H(1)

• Gain = 0.940 – 0.694 = .246


Maximize Information Gain

• Gain of each attribute• Gain(Outlook) = 0.246• Gain(Humidity) = 0.151• Gain(Wind) = 0.048• Gain(Temperature) = 0.029

Outlook

? ?

SunnyOvercast

Rain

{D1, D2, …, D14}[9+,5-]

Yes

{D3, D7, D12, D13}[4+,0-]

{D4, D5, D6,D10, D14}[3+,2-]

{D1, D2, D8,D9, D11}[2+,3-]


Unbiased Learner• Provide a hypothesis space capable of

representing every teachable concept• Every possible subset of the instances X (power set of

X)• How large is this space?

• For EnjoySport, there are 96 instances in X• The power set is 2|X|

• EnjoySport has 1028 distinct target concepts• Allows disjunctions, conjunctions, and negations• Can no longer generalize beyond observed

examples


Inductive Bias

• All learning methods have an inductive bias.• The inductive bias of a learning method is the set

of restrictions on the learning method.• Without inductive bias, a learning method could

not learn to generalize.• A learner that makes no a priori assumptions regarding

the identity of the target concept has no rational basis for classifying any unseen instances


Bias in Learning Algorithms

• Rote-Learner: If the instance is found in memory, the stored classification is returned. Otherwise, the system refuses to classify the new instance

• Find-S: Finds the most specific hypothesis consistent with the training examples. It then uses this hypothesis to classify all subsequent instances


Candidate-Elimination Bias

• Candidate-Elimination will converge to true target concept provided accurate training examples and its initial hypothesis space contains the true target concept• Only consider conjunctions of attribute values• Cannot represent "Sky = Sunny or Sky = Cloudy"

• What if the target concept is not contained in the hypothesis space?

Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

1 Sunny Warm Normal Strong Cool Change Yes

2 Cloudy Warm Normal Strong Cool Change Yes

3 Rainy Warm Normal Strong Cool Change No


Bias of ID3

• Choose the first acceptable tree it encounters in its simple-to-complex, hill-climbing search• Favors shorter trees over longer ones• Selects trees that place the attributes with

highest information gain closest to the root

• Interaction between attribute selection heuristic and training examples makes it difficult to precisely characterize its bias


ID3 vs. Candidate Elimination

• Difference between the types of inductive bias• Hypothesis space

• ID3 searches a complete hypothesis space• Inductive bias is a consequence of the ordering of

hypotheses by its search strategy

• Candidate-Elimination searches an incomplete hypothesis space

• Searches the space completely

• Inductive bias is a consequence of the expressive power of its hypothesis representation


Why Prefer Short Hypotheses?• Occam's razor

• Prefer the simplest hypothesis that fits the data• Appling Occam's razor

• Fewer short hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data

• A 5-node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis

• Problems with this argument• By the same argument, you could put many more qualifications on

the decision tree. Would that be better?• Size is determined by the particular representation used internally

by the learner• Don't reject Occam's razor all together

• Evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm


The Problem of Overfitting

Black dots represent positive examples, white dots negative.

The two lines represent two different hypotheses.

In the first diagram, there are just a few items of training data, correctly classified by the hypothesis represented by the darker line.

In the second and third diagrams we see the complete set of data, and that the simpler hypothesis which matched the training data less well matches the rest of the data better than the more complex hypothesis, which overfits.


The Nearest Neighbor Algorithm (1)• This is an example of instance based learning.• Instance based learning involves storing training

data and using it to attempt to classify new data as it arrives.

• The nearest neighbor algorithm works with data that consists of vectors of numeric attributes.

• Each vector represents a point in n-dimensional space.


The Nearest Neighbor Algorithm (2)• When an unseen data item is to be classified, the Euclidean

distance is calculated between this item and all training data.

• the distance between <x1, y1> and <x2, y2> is:

• The classification for the unseen data is usually selected as the one that is most common amongst the few nearest neighbors.

• Shepard’s method involves allowing all training data to contribute to the classification with their contribution being proportional to their distance from the data item to be classified.

cs 484 – artificial intelligence1 announcements list of 5 source for research paper homework 5 due...

Documents

hypothesis s

hypothesis g

hypothesis inconsistent

dremove s

member of s

hinitialize s

dremove g

member of g