cs 478 – tools for machine learning and data mining
DESCRIPTION
CS 478 – Tools for Machine Learning and Data Mining. The Need for and Role of Bias. Learning. Rote: Until you discover the rule/concept(s), the very BEST you can ever expect to do is: Remember what you observed Guess on everything else Inductive: - PowerPoint PPT PresentationTRANSCRIPT
CS 478 – Tools for Machine Learning and Data Mining
The Need for and Role of Bias
Learning
• Rote:– Until you discover the rule/concept(s), the very
BEST you can ever expect to do is:• Remember what you observed• Guess on everything else
• Inductive:– What you do when you GENERALIZE from your
observations and make (accurate) predictions
Claim: “All [most of] the laws of nature were discovered by inductive reasoning”
The Big Question
• All you have is what you have OBSERVED• Your generalization should at least be
consistent with those observations• But beyond that…
– How do you know that your generalization is any good?
– How do you choose among various candidate generalizations?
The Answer Is
BIASAny basis for choosing one decision over
another, other than strict consistency with past observations
Why Bias?
• If you have no bias you cannot go beyond mere memorization– Mitchell’s proof using UGL and VS
• The power of a generalization system follows directly from its biases
• Progress towards understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases
Concept Learning Given:
A language of observations/instances A language of concepts/generalizations A matching predicate A set of observations
Find generalizations that:1. Are consistent with the observations, and2. Classify instances beyond those observed
Claim
The absence of bias makes it impossible to solve part 2 of the Concept Learning problem, i.e.,
learning is limited to rote learning
Unbiased Generalization Language
• Generalization set of instances it matches• An Unbiased Generalization Language (UGL),
relative to a given language of instances, allows describing every possible subset of instances
• UGL = power set of the given instance language
Unbiased Generalization Procedure
• Uses Unbiased Generalization Language • Computes Version Space (VS) relative to UGL
• VS set of all expressible generalizations consistent with the training instances
Version Space (I)
• Let:– S be the set of maximally specific generalizations
consistent with the training data – G be the set of maximally general generalizations
consistent with the training data
Version Space (II)
• Intuitively– S keeps generalizing to accommodate new positive
instances– G keeps specializing to avoid new negative instances
• The key issue is that they only do that to the smallest extent necessary to maintain consistency with the training data, that is, G remains as general as possible and S remains as specific as possible.
Version Space (III)
• The sets S and G precisely delimit the version space (i.e., the set of all plausible versions of the emerging concept).
• A generalization g is in the version space represented by S and G if and only if:– g is more specific than or equal to some member
of G, and– g is more general than or equal to some member
of S
Version Space (IV)• Initialize G to the most general concept in the space• Initialize S to the first positive training instance• For each new positive training instance p
– Delete all members of G that do not cover p– For each s in S
• If s does not cover p– Replace s with its most specific generalizations that cover p
– Remove from S any element more general than some other element in S– Remove from S any element not more specific than some element in G
• For each new negative training instance n– Delete all members of S that cover n– For each g in G
• If g covers n– Replace g with its most general specializations that do not cover n
– Remove from G any element more specific than some other element in G– Remove from G any element more specific than some element in S
• If G=S and both are singletons– A single concept consistent with the training data has been found
• If G and S become empty– There is no concept consistent with the training data
Lemma 1
Any new instance, NI, is classified as positive if and only if NI is identical to some observed
positive instance
Proof of Lemma 1
• (). If NI is identical to some observed positive instance, then NI is classified as positive– Follows directly from the definition of VS
• (). If NI is classified as positive, then NI is identical to some observed positive instance– Let g={p: p is an observed positive instance}
• UGL gVS• NI matches all of VS NI matches g
Lemma 2
Any new instance, NI, is classified as negative if and only if NI is identical to some observed
negative instance
Proof of Lemma 2
• (). If NI is identical to some observed negative instance, then NI is classified as negative– Follows directly from the definition of VS
• (). If NI is classified as negative, then NI is identical to some observed negative instance – Let G={all subsets containing observed negative instances}
• UGL GVS=UGL • NI matches none in VS NI was observed
Lemma 3
If NI is any instance which was not observed, then NI matches exactly one half of VS, and so
cannot be classified
Proof of Lemma 3
• (). If NI was not observed, then NI matches exactly one half of VS, and so cannot be classified – Let g={p: p is an observed positive instance}– Let G’={all subsets of unobserved instances}
• UGL VS={gg’: g’G’}• NI was not observed NI matches exactly ½ of G’ NI
matches exactly ½ of VS
Theorem
An unbiased generalization procedure can never make the inductive leap necessary to classify
instances beyond those it has observed
Proof of the Theorem
• The result follows immediately from Lemmas 1, 2 and 3
• Practical consequence: If a learning system is to be useful, it must have some form of bias
Sources of Bias in Learning
• The representation language cannot express all possible classes of observations
• The generalization procedure is biased– Domain knowledge (e.g., double bonds rarely break)– Intended use (e.g., ICU – relative cost)– Shared assumptions (e.g., crown, bridge – dentistry)– Simplicity and generality (e.g., white men can’t jump)– Analogy (e.g., heat vs. water flow, thin ice)– Commonsense (e.g., social interactions, pain, etc.)
Fall 2004 CS 478 - Machine Learning
23
Representation Language
Decrease number of expressible generalizations Increase ability to make the inductive leap
Example: Restrict generalizations to conjunctive constraints on features in a Boolean domain
Fall 2004 CS 478 - Machine Learning
24
Proof of Concept (I)
Let N = number of features22N subsets of instances
Let GL = {0, 1, *}can only denote subsets of size 2p for 0pN
CS 478 - Machine Learning
25Fall 2004
Proof of Concept (II)
For each p, there are only 2N-p expressible subsets Fix N-p features (there are ways of choosing
which)Set values for the selected features (there are 2N-p
possible settings)
pNNC
pNNC
CS 478 - Machine Learning
26Fall 2004
Proof of Concept (III)
Excluding the empty set, the ratio of expressible to total generalizations is given by:
1N22
N0p
pNNC
pN2
CS 478 - Machine Learning
27Fall 2004
Proof of Concept (IV)
For example, if N=5 then only about 1 in 107 subsets may be representedStrong bias
Two-edge sword: representation could be too sparse
Generalization Procedure
• Domain knowledge (e.g., double bonds rarely break)
• Intended use (e.g., ICU – relative cost)• Shared assumptions (e.g., crown, bridge –
dentistry)• Simplicity and generality (e.g., white men can’t
jump)• Analogy (e.g., heat vs. water flow, thin ice)• Commonsense (e.g., social interactions, pain, etc.)
Fall 2004 CS 478 - Machine Learning
29
Conclusion
Absence of bias = rote learningEfforts should focus on combined use of prior
knowledge and observations in guiding the learning process
Make biases and their use as explicit as observations and their use