cs546: machine learning and natural language lecture 7: introduction to classification 2009
DESCRIPTION
CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009. Notes. Class Presentations: You do not need to (and in fact, cannot) present the whole paper. Focus on what’s new , what’s interesting , what’s unique . - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/1.jpg)
1
CS546: Machine Learning and Natural Language
Lecture 7: Introduction to Classification
2009
![Page 2: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/2.jpg)
2
Class Presentations:
You do not need to (and in fact, cannot) present the whole paper.
Focus on what’s new, what’s interesting, what’s unique. Don’t re-do material presented in class; assume it’s
known and mention in passing if needed.
Notes
![Page 3: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/3.jpg)
3
Illinois’ bored of education [board]Nissan Car and truck plant; plant and animal
kingdom(This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian;
a hospital Tiger was in Washington for the PGA Tour Finance; Banking; World News;
Sports
Important or not important; love or hate
Classification: Ambiguity Resolution
![Page 4: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/4.jpg)
4
The goal is to learn a function f: X Y that maps observations in a domain to one of several categories.
Task: Decide which of {board ,bored } is more likely in the given context:
X: some representation of: The Illinois’ _______ of education met yesterday… Y: {board ,bored }
Typical learning protocol: Observe a collection of labeled examples (x,y) 2 X £ Y Use it to learn a function f:XY that is consistent with
the observed examples, and (hopefully) performs well on new, previously unobserved examples.
Classification
![Page 5: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/5.jpg)
5
Theoretically: Generalization It is possible to say something rigorous about future behavior of learners
Good understanding of the issues that affect the quality of
generalization E.g., how many example does one need to see in order to guarantee
good behavior on previously unobserved examples. Algorithmically: good learning algorithms for linear representations.
Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-line. Understanding that many algorithms behaves about the same.
Key issues remaining: Learning protocols: how to minimize interaction (supervision); how to
map domain/task information to supervision; semi-supervised learning; active learning. ranking.
What are the features? No good theoretical understanding here. Building systems that make use of multiple, possibly dependent
classifiers
Classification is Well Understood
![Page 6: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/6.jpg)
6
Classification Model and Justification (discriminative model)
Probabilistic Models Relations to discriminative models; justification
Algorithms Linear Classification: Perceptron, SVM, etc Max Entropy Features and Kernels
Learning Protocols How to get supervision (semi-supervised; co-training; co-ranking)
Multi Class Classification, Structured Prediction, etc As a generalization of Boolean Classification Models As a way to understand Structured learning and Inference.
Global Models and Inference Joint training, Modular training
Coarse Plan (not Chronlogical)
![Page 7: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/7.jpg)
7
Illinois’ bored of education. board
We took a walk it the park two. in, too
We fill it need no be this way feel, not
The amount of chairs in the room is… number
I’d like a peace of cake for desert piece, dessert
Context Sensitive Text Correction
![Page 8: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/8.jpg)
8
Disambiguation Problems
Middle Eastern ____ are known for their sweetness Task: Decide which of { deserts , desserts } is more likely in the given context.
C={ Noun,Adj.., Verb…}
C={ topic=Finance, topic=Computing}
C={ NE=Person, NE=location}
Ambiguity:modeled as confusion sets (class labels C )
C={ deserts, desserts}
![Page 9: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/9.jpg)
9
Disambiguation Problems
Archetypical disambiguation problem
Data is available
In principle, a solved problem Golding&Roth, Mangu&Brill,… But Many issues are involved in making an “in
principle” solution a realistic one
![Page 10: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/10.jpg)
10
Learning to Disambiguate
Given a confusion set C={ deserts, desserts} sentence (s) Middle Eastern ____ are known for their sweetness Map into a feature based representation Learn a function FC that determines which of
C={ deserts, desserts} more likely in a given context.
Evaluate the function on future C sentences
![Page 11: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/11.jpg)
11
S= I don’t know whether to laugh or cry [x x x x]Consider words, pos tags, relative location in windowGenerate binary features representing presence of:
a word/pos within window around target word don’t within +/-3 know within +/-3 Verb at -1 to within +/- 3 laugh within +/-3 to a +1
conjunctions of size 2, within window of size 3 words: know__to; ___to laugh pos+words: Verb__to; ____to Verb
Learning Approach: Representation
![Page 12: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/12.jpg)
12
S= I don’t know whether to laugh or cry Is represented as a set of its active features
S= (don’t at -2 , know within +/-3,… ____to Verb,...)Label= the confusion set element that occurs in the
text
Hope: S=I don’t care whether to laugh or cry
has almost the same representation
This representation can be used by any propositional learning algorithm. (features, examples)
Previous works: TBL (Decision Lists) NB, SNoW, DT,...
Learning Approach: Representation
![Page 13: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/13.jpg)
13
There is a huge number of potential features (~105). Out of these – only a small number is actually active
in each example.
The representation can be significantly smaller if we list only features that are active in each examples.
Some algorithms can take this into account. Some cannot. (Later).
Notes on Representation
![Page 14: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/14.jpg)
14
Formally: A feature =a characteristic function over
sentences
When the number of features is fixed, the collection of all examples is
When we do not want to fix the number of features (very large number, on-line algorithms,…) can work in the infinite attribute domain
}1,0{: S
nn }1,0{)},...,,{( 21
}1,0{,...)},...,,{( 21 n
Notes on Representation (2)
![Page 15: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/15.jpg)
15
Consider all training data S: {(l, f, f,….)}Represent as:
S={(f, #(l=0), #(l=1)} for all features
1. Choose best feature f* (and the label it suggests)2. S S \ {Examples labeled in (1)}3. GoTo 1
An Algorithm
“best” can be defined in multiple ways. E.g., sort features by P(label | feature).
![Page 16: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/16.jpg)
16
If f1 then label Else, if f2 then label Else… Else default label
A decision list
Issues: How well will this do? We train on the training data, what about new data?
An Algorithm:Hypothesis
![Page 17: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/17.jpg)
17
I saw the girl it the park The from needs to be completed
I maybe there tomorrow
New sentences you have not seen before. Can you recognize and correct it this time?
Intuitively, there are some regularities in the language, “identified” from previous examples, which can be utilized on future examples.
Two technical ways to formalize this intuition
Generalization
![Page 18: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/18.jpg)
18
Model the problem of text correction as a problem of learning from examples.
Goal: learn directly how to make predictions.
PARADIGM Look at many examples. Discover some regularities in the data. Use these to construct a prediction policy. A policy (a function, a predictor) needs to be specific.
[it/in] rule: if the occurs after the target in(in most cases, it won’t be that simple, though)
1: Direct Learning (Distribution-Free)
Non-Parametric Induction; Empirical Risk Minimization; Induction principle: Large deviation probabilistic bounds.
![Page 19: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/19.jpg)
19
Model the problem of text correction as that of generating correct sentences.
Goal: learn a model of the language; use it to predict.
PARADIGM Learn a probability distribution over all sentences
Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the
park)[In the same paradigm we sometimes learn a conditional
probability distribution]
In practice: make assumptions on the distribution’s type
In practice: a decision policy depends on the assumptions
2: Generative Model (or Prob models in general)
Parametric Inference: Induction principle: Maximum Likelihood
![Page 20: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/20.jpg)
20
Model 1: There are 5 characters, A, B, C, D, E At any point can generate any of them, according to: P(A)= 0.3; P(B) =0.1; P(C) =0.2; P(D)= 0.2; P(E)= 0.1
P(END)=0.1
A sentence in the language: AAACCCDEABB. A less likely sentence: DEEEEBBBBEEEEBBBBEEE Instantiating the Generative Model:
Given the model, can compute the probability of a sentence, and decide which is more likely/ how to predict next character
Given a family of models choose a specific model.
Example: Model of Language
![Page 21: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/21.jpg)
21
Model 2: A probabilistic finite state model.
Start: Ps(A)=0.4; Ps(B)=0.4; Ps(C)=0.2
From A: PA(A)=0.5; PA(B)=0.3; PA(C)=0.1; PA(S)=0.1
From B: PB(A)=0.1; PB(B)=0.4; PB(C)=0.4; PB(S)=0.1
From C: PC(A)=0.3; PC(B)=0.4; PC(C)=0.2 ; PC(S)=0.1
Practical issues: What is the space over which we define the
model? Characters? Words? Ideas? How do we acquire the model? Estimation;
Smoothing
Example: Model of Language
![Page 22: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/22.jpg)
22
The difference in not along probabilistic/deterministic or statistical/symbolic Lines. Both paradigms can do both.
The difference is in the basic assumptions underlying the paradigms, and why they work.
1st: Distribution Free: uncover regularities in the past; hope they will be there in the future.
2nd: Know the (type of) probabilistic model of the language (target phenomenon). Use it.
Distr.-Free vs. Probabilistic/Generative: major philosophical debate in learning. Interesting computational issues too.
Learning Paradigms: Comments
![Page 23: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/23.jpg)
23
Goal: discover some regularities from examples and generalize to previously unseen data.
What are the examples we learn from? Instance Space X: The space of all examples
How do we represent our hypothesis? Hypothesis Space H: Space of potential functions
Goal: given training data SX, find a good hH
}1,0{: Xh
}1,0{or }1,0{ nX
Distribution Free Learning: Formalism
![Page 24: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/24.jpg)
24
Learning is impossible, unless…. Outcome of Learning cannot be trusted, unless,… How can we quantify the expected generalization? Assume h is good on the training data; what can be
said on h’s performance on previously unseen data?
These are some of the topics studied in Computational Learning Theory (COLT)
Why Does Learning Work?
![Page 25: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/25.jpg)
25
Given: Training examples (x,f ((x)) of unknown function f
Find: A good approximation to f
y = f (x1, x2, x3, x4)Unknownfunction
x1
x2
x3
x4
Example x1 x2 x3 x4 y 1 0 0 1 0 0
3 0 0 1 1 1
4 1 0 0 1 1
5 0 1 1 0 0
6 1 1 0 0 0
7 0 1 0 1 0
2 0 1 0 0 0
Learning is impossible, unless…
![Page 26: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/26.jpg)
26
Complete Ignorance: There are 216 = 56536 possible functions over four input features.
We can’t figure out which one is correct until we’ve seen every possible input-output pair.
Even after seven examples we still have 29 possibilities for f
Is Learning Possible?
Example x1 x2 x3 x4 y
1 1 1 1 ?
0 0 0 0 ?
1 0 0 0 ?
1 0 1 1 ? 1 1 0 0 0 1 1 0 1 ?
1 0 1 0 ? 1 0 0 1 1
0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 ?
0 0 1 1 1 0 0 1 0 0 0 0 0 1 ?
1 1 1 0 ?
Why Does Learning Work (2)?
![Page 27: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/27.jpg)
27
Simple Rules: There are only 16 simple conjunctive rules of the form
y=xi xj xk
Try to learn a function of this form that explains the data. (try it: there isn’t).
m-of-n rules: There are 29 possible rules of the form
”y = 1 if and only if at least m of the following n variables are 1”
(try it, there is).
Hypothesis Space
![Page 28: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/28.jpg)
28
Bias
Learning requires guessing a good, small hypothesis class.
We can start with a very small class and enlarge it until it contains a hypothesis that fits the data.
(model selection)
We could be wrong ! There is a robustness issue here; why aren’t we wrong
more often? This also relates to the probabilistic view of prediction, and
to the relation between the paradigms. One way to think about this robustness is via the stability
of our prediction.
![Page 29: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/29.jpg)
29
Can We Trust the Hypothesis?
There is a hidden conjunction the learner is to learn f=x2 x3 x4 x5 x100
How many examples are needed to learn it ? How ?
Protocol: Some random source (e.g., Nature) provides training
examples; Teacher (Nature) provides the labels (f(x))
Not the only possible protocol (membership query; teaching)
<(1,1,1,1,1,1,…,1,1), 1> <(1,1,1,0,0,0,…,0,0), 0> <(1,1,1,1,1,0,...0,1,1),1> <(1,0,1,1,1,0,...0,1,1), 0> <(1,1,1,1,1,0,...0,0,1),1> <(1,0,1,0,0,0,...0,1,1), 0> <(1,1,1,1,1,1,…,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0>
![Page 30: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/30.jpg)
30
Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal if not active (0) in a positive
example
<(1,1,1,1,1,1,…,1,1), 1> f= x1 x2 x3 x4 x5 x100
<(1,1,1,0,0,0,…,0,0), 0> learned nothing <(1,1,1,1,1,0,...0,1,1),1> f= x1 x2 x3 x4 x5 x99 x100
<(1,0,1,1,0,0,...0,0,1),0> learned nothing <(1,1,1,1,1,0,...0,0,1),1> f= x1 x2 x3 x4 x5 x100
<(1,0,1,0,0,0,...0,1,1),0> <(1,1,1,1,1,1,…,0,1), 1> <(0,1,0,1,0,0,...0,1,1),0> Final: f= x1 x2 x3 x4 x5 x100
Learning Conjunction
![Page 31: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/31.jpg)
31
Instance Space: X Hypothesis Space: H (set of possible hypotheses) Training instances S:
positive and negative examples of the target f S: sampled according to a fixed, unknown,
probability distribution D over X Determine: A hypothesis h H such that
h(x) = f(x) for all x S ? h(x) = f(x) for all x X ?
Evaluated on future instances sampled according to D
f= x1 x2 x3 x4 x5 x100
Prototypical Learning Scenario
![Page 32: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/32.jpg)
32
Have seen many examples (drawn according to D )
Since in all the positive examples x1 was active, it is likely to be active in future positive examples
Even if not, in any case, in D, x1 is not active only in relatively few positive examples, so our error will be small.
fh
f and h disagree
++
-
-
-h(x)][f(x)Error
DxD Pr
Error can be boundedVia Chernoff bounds
A distribution free notion!
PAC Learning: Intuition
![Page 33: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/33.jpg)
33
Generalization for Consistent Learners
Claim: The probability that there exists a hypothesis h H that: (1) is consistent with m examples and (2) satisfies err(h) > is less than |H|(1- )m
For any distribution D governing the IID generation of training and test instances, for all hH , for all 0< , <1, if
m > {ln(|H|) + ln(1/ )}/ Then, with probability at least 1- (over the
choice of the training set of size m), err(h) <
Equivalently:
![Page 34: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/34.jpg)
34
Generalization for Consistent Learners
Proof: Let h be such a bad hypothesis (as in (2)). The probability that h is consistent with one example of f
isP xD [f(x)=h(x)] < (1- )
Since the m examples are drawn independently of each other, the probability that h is consistent with m examples is less than (1- )m
The probability that some hypothesis in H is consistent with m examples is less than |H|(1- )m
Claim: The probability that there exists a hypothesis h H that:
(1) is consistent with m examples and (2) satisfies err(h) > is less than |H|(1- )m
![Page 35: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/35.jpg)
35
Generalization for Consistent Learners
We want this probability to be smaller than , that is:
|H|(1- )m < And with (1- x < e-x) ln(|H|) - m < ln()
For any distribution D governing the IID generation of training and test instances, for all hH , for all 0< , <1, if
m > {ln(|H|) + ln(1/ )}/ Then, with probability at least 1- (over the
choice of the training set of size m), err(h) <
What kind of hypothesis spaces do we want ? Large ? Small ?
Do we want the smallest H possible ?
![Page 36: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/36.jpg)
36
Generalization (Agnostic Learners) In general: we try to learn a concept f using hypotheses in H, but f
H
Our goal should be to find a hypothesis hH, with a small training error:
ErrTR(h)= PxS [f(x)h(x)]
We want a guarantee that a hypothesis with a small training error will have a good accuracy on unseen examples
ErrD(h)= PxD [f(x)h(x)] Hoeffding bounds characterize the deviation between the
true probability of an event and its observed frequency over m independent trials.
Pr(p > E(p)+ ) < exp{-2m 2}
(p is the underlying probability of the binary variable being 1)
![Page 37: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/37.jpg)
37
Generalization (Agnostic Learners) Therefore, the probability that an element hH will have training
error which is off by more than can be bounded as follows:
Pr(ErrD(h) > ErrTR(h) + ) < exp{-2m 2}
As in the consistent case: use union bound to get a uniform bound on all H; to get |H|exp{-2m2} < we have the following generalization bound: a bound on how much will the true error deviate from the observed error.
For any distribution D generating training and test instances, with probability at least 1- over the choice of the training set of size m, (drawn IID), for all hH
mH
hErrhErr TRD
2)/1log(||log
)()(
![Page 38: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/38.jpg)
38
Summary: Generalization
Learnability depends on the size of the hypothesis space.
In the case of a finite hypothesis space:
In the case of an infinite hypothesis space
Where VC(H) is the Vapnik-Chernvonenkis dimension of the hypothesis class, a combinatorial measure of its complexity.
mH
hErrhErr TRD
2)/1log(||log
)()(
mHkVC
hErrhErr TRD
2)/1log()(
)()(
Later we will see that this understanding of generalization has also some immediate algorithmic implications (essentially, model selection implications).
![Page 39: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/39.jpg)
39
Learning Theory: Summary (1)
Labeled observations sampled according to a distribution D on
Goal: to compute a hypothesis hH that performs well on future, unseen observations.
} l)(x, {S m1i
{0,1} X
Assumption: test examples are also sampled according to D (label is not observed)
![Page 40: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/40.jpg)
40
Look for hH that minimizes the true error
All we get to see is the empirical error
l][h(x)Pr(h)Err Dl)(x,D
|S| /|} lh(x)|Sx {|(h)ErrS
)]/mln(1/[kVC(H)(h)Err (h)Err SD
Basic theorem: With probability at least (1-)
Learning Theory:Summary(2) [Why does it work?]
![Page 41: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/41.jpg)
41
Use Hypothesis Space with small expressivity E.g. prefer to use a function that is linear in the feature space, over
higher order functionsf(x) = Icii
VC dimension of a linear function of dimension N: is N+1
Data dependent VC notions:
The VC dimension of the class of separators with margin grows as becomes smaller.
Sparsity: If there are a maximum of k active in each example then VC dimension is k+1
Algorithmic issues: There are good algorithms for linear function; learning higher order functions is computationally hard.
This will imply that we will make an effort to learn linear functions even when we know that our target function isn’t linear in the raw input.
Practical LessonThree directions we can go from here:
-Generalization
-Details of the Direct Learning Paradigm
-The probabilistic Paradigm.
![Page 42: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/42.jpg)
42
VC dimension based bounds are unrealistic. The value is mostly in providing quantitative understanding of
“why learning works” and what are the important complexity parameters.
In recent years, this understanding has helped both to drive new algorithms Develop new methods that can actually provide somewhat realistic
generalization bounds.
PAC-Bayes Methods (McAllester, McAllester&Langford)
Random Projection/Weighted margin Methods (Garg, Har-Peled, Roth)
This method can be shown to have some algorithmic implications.
Advances in Theory of Generalization
![Page 43: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/43.jpg)
43
Model the problem of text correction as that of generating correct sentences.
Goal: learn a model of the language; use it to predict.
PARADIGM Learn a probability distribution over all sentences
Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the
park)[In the same paradigm we sometimes learn a conditional
probability distribution]
In practice: make assumptions on the distribution’s type
In practice: a decision policy depends on the assumptions
2: Generative Model
![Page 44: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/44.jpg)
44
An Example
I don’t know {whether, weather} to laugh or cry
How can we make this a learning problem?
We will look for a function F: Sentences {whether, weather} We need to define the domain of this function better.
An option: For each word w in English define a Boolean feature xw :
[xw =1] iff w is in the sentence This maps a sentence to a point in {0,1}50,000
In this space: some points are whether points some are weather points
![Page 45: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/45.jpg)
45
What’s Good?
Learning problem: Find a function that best separates the data
What function? What’s best? How to find it?
A possibility: Define the learning problem to be:Find a (linear) function that best separates the data
![Page 46: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/46.jpg)
46
Exclusive-OR (XOR)
In general: a parity function.
xi in {0,1} f(x1, x2,…, xn) = 1
iff xi is even
This function is not linearly separable.
x1
x2
![Page 47: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/47.jpg)
47
Functions Can be Made Linear
x1 x2 x4 + x2 x4 x5 +x1 x3 x7
Space: X= x1, x2,…, xn
Input Transformation
New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj}
Weather
Whether
y3 + y4 + y7 New
discriminator is functionally
simpler
![Page 48: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/48.jpg)
48
Data are not separable in one dimension Not separable if you insist on using a specific class
of functions
x
Feature Space
![Page 49: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/49.jpg)
49
Blown Up Feature Space
Data are separable in <x, x2> space
x
x2
Key issue: what features to use.
Computationally, can be done implicitly (kernels)
But there are warnings.
![Page 50: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/50.jpg)
50
A General Framework for Learning
Goal: predict an unobserved output value y based on an observed input vector x
Estimate a functional relationship y~f(x) from a set {(x,y)i}i=1,n
Most relevant - Classification: y {0,1} (or y {1,2,…k} ) (But, within the same framework can also talk about Regression
or Structured Prediction)
What do we want f(x) to satisfy?
We want to minimize the Loss (Risk): L(f()) = E X,Y( [f(x)y] ) Where: E X,Y denotes the expectation with respect to the true
distribution.
Simply: # of mistakes[…] is a indicator function
![Page 51: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/51.jpg)
51
A General Framework for Learning (II)
We want to minimize the Loss: L(f()) = E X,Y( [f(X)Y] ) Where: E X,Y denotes the expectation with respect to the true
distribution.
We cannot do that. Instead, we try to minimize the empirical classification error. For a set of training examples {(Xi,Yi)}i=1,n
Try to minimize: L’(f()) = 1/n i [f(Xi)Yi]
This minimization problem is typically NP hard. To alleviate this computational problem, minimize a new function –
a convex upper bound of the classification error function
I(f(x),y) =[f(x) y] = {1 when f(x)y; 0 otherwise}
![Page 52: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/52.jpg)
52
Learning as an Optimization Problem
A Loss Function L(f(x),y) measures the penalty incurred by a classifier f on example (x,y).
There are many different loss functions one could define:
Misclassification Error:
L(f(x),y) = 0 if f(x) = y; 1 otherwise Squared Loss:
L(f(x),y) = (f(x) –y)2
Input dependent loss:
L(f(x),y) = 0 if f(x)= y; c(x)otherwise.A continuous convex loss function allows a simpler optimization algorithm.
f(x) –y
L
![Page 53: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/53.jpg)
53
How to Learn? Local search:
Start with a linear threshold function. See how well you are doing. Correct Repeat until you converge (??).
Optimization: Define an objective function such that its solution separates the data well.
Direction Computation of hypothesis
![Page 54: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/54.jpg)
54
Assignment: Verb Prediction The goal is to predict a verb given a context:
Supervised learning, examples like: She <<dropped>> the ball.
She << ? >> the ball Good answer is probably “drop” or “kick” Two versions: predict verb form (“dropped”) or just base
(“drop”) There will be two approaches to compare:
Language Modeling You will have to implement the model yourself (incl.
smoothing) Classification
Use available packages: FEX for feature extraction and SnOW for classification
You need to decide on features, algorithm, parameters Original ideas: other datasets, web, clustering, etc
![Page 55: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/55.jpg)
55
Assignment: Verb Prediction Additional annotation is available, example
sentence:
![Page 56: CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification 2009](https://reader035.vdocuments.us/reader035/viewer/2022062720/56813488550346895d9b6c11/html5/thumbnails/56.jpg)
56
Assignment: Verb Prediction See the course web-page for:
Description of the assignment (see detailed description at the bottom)
Pointers to relevant papers, tutorials for SNoW and FEX
Teams: Team 1: Sarah Borys, Maryam Karimzadehgan, Majid
Kazemian Team 2: Kavita Ganesan, Hyun Duk Kim, Parikshit Sondhi Team 3: Ryan Cunningham, Gourab Kundu, Daniel Schreiber Team 4: Oscar Sanchez Plazas, Mehwish Riaz, Scott Wegner
Feel free to swap; this arrangement is based on surveys (but no surveys from Kavita and Parikshit)
Due: Feb, 25; Next review due Mar, 6.