supervised learning: the setup computational learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...
Post on 17-Sep-2020
2 Views
Preview:
TRANSCRIPT
1
Computational Learning Theory: An Analysis of a Conjunction Learner
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2018
The slides are mainly from Vivek Srikumar
This section
1. Analyze a simple algorithm for learning conjunctions
2. Define the PAC model of learning
3. Make formal connections to the principle of Occam’s razor
2
Learning Conjunctions
Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
3
The true function
Learning Conjunctions
Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
A simple learning algorithm (Elimination)
• Discard all negative examples
4
Learning Conjunctions
Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A simple learning algorithm (Elimination)
• Discard all negative examples• Build a conjunction using the features
that are common to all positive conjunctions
5
Learning Conjunctions
Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A simple learning algorithm (Elimination)
• Discard all negative examples• Build a conjunction using the features
that are common to all positive conjunctions
6
Positive examples eliminate irrelevant features
Learning Conjunctions
Training data– <(1,1,1,1,1,1,…,1,1), 1>
– <(1,1,1,0,0,0,…,0,0), 0>
– <(1,1,1,1,1,0,...0,1,1), 1>
– <(1,0,1,1,1,0,...0,1,1), 0>
– <(1,1,1,1,1,0,...0,0,1), 1>
– <(1,0,1,0,0,0,...0,1,1), 0>
– <(1,1,1,1,1,1,…,0,1), 1>
– <(0,1,0,1,0,0,...0,1,1), 0>
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A simple learning algorithm:
• Discard all negative examples• Build a conjunction using the features
that are common to all positive conjunctions
7
Clearly this algorithm produces a conjunction that is consistent with the data, that is errS(h) = 0, if the target function is a monotone conjunctionExercise: Why?
Learning Conjunctions: Analysis
Claim 1: h will only make mistakes on positive future examplesWhy?
8
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Learning Conjunctions: Analysis
Claim 1: h will only make mistakes on positive future examplesWhy?
9
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A mistake will occur only if some feature z (in our example x1) is present in h but not in f
This mistake can cause a positive example to be predicted as negative by h
The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present
Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1
Learning Conjunctions: Analysis
Claim 1: h will only make mistakes on positive future examplesWhy?
10
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A mistake will occur only if some feature z (in our example x1) is present in h but not in f
This mistake can cause a positive example to be predicted as negative by h
The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present
Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1
Learning Conjunctions: Analysis
Claim 1: h will only make mistakes on positive future examples
Why?
11
f
h++
-
-
-
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
A mistake will occur only if some feature z (in our
example x1) is present in h but not in f
This mistake can cause a positive example to be predicted as
negative by h
The reverse situation can never happen
For an example to be predicted as positive in the training set,
every relevant feature must have been present
Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1
Learning Conjunctions: Analysis
Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If
then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".
12
Learning Conjunctions: Analysis
Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If
then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".
13
If we see these many training examples, then the algorithm will produce a conjunction that, with high probability, will
make few errors
Poly in n, 1/!, 1/"
Learning Conjunctions: Analysis
Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If
then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".
14
Let’s prove this assertion
Proof Intuition
What kinds of examples would drive a hypothesis to make a mistake?
Positive examples, where x1 is absentf would say true and h would say false
None of these examples appeared during trainingOtherwise x1 would have been eliminated
If they never appeared during training, maybe their appearance in the future would also be rare!
Let’s quantify our surprise at seeing such examples
15
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Proof Intuition
What kinds of examples would drive a hypothesis to make a mistake?
Positive examples, where x1 is absentf would say true and h would say false
None of these examples appeared during trainingOtherwise x1 would have been eliminated
If they never appeared during training, maybe their appearance in the future would also be rare!
Let’s quantify our surprise at seeing such examples
16
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Proof Intuition
What kinds of examples would drive a hypothesis to make a mistake?
Positive examples, where x1 is absentf would say true and h would say false
None of these examples appeared during trainingOtherwise x1 would have been eliminated
If they never appeared during training, maybe their appearance in the future would also be rare!
Let’s quantify our surprise at seeing such examples
17
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Proof Intuition
What kinds of examples would drive a hypothesis to make a mistake?
Positive examples, where x1 is absentf would say true and h would say false
None of these examples appeared during trainingOtherwise x1 would have been eliminated
If they never appeared during training, maybe their appearance in the future would also be rare!
Let’s quantify our surprise at seeing such examples
18
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Proof Intuition
What kinds of examples would drive a hypothesis to make a mistake?
Positive examples, where x1 is absentf would say true and h would say false
None of these examples appeared during trainingOtherwise x1 would have been eliminated
If they never appeared during training, maybe their appearance in the future would also be rare!
Let’s quantify our surprise at seeing such examples
19
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
20
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
21
Remember that there will only be mistakes on positive examples for this toy problem
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
22
f = x2 ∧ x3∧ x4 ∧ x5∧ x100
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Remember that there will only be mistakes on positive examples for this toy problem
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
23
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Remember that there will only be mistakes on positive examples for this toy problem
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
24
f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>
p(x1): Probability that this situation occurs
h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100
Remember that there will only be mistakes on positive examples for this toy problem
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
We know that
Via direct application of the union bound
25
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
We know that
Via direct application of the union bound
26
Learning Conjunctions: Analysis
Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label
• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake
• For any z in the target function, p(z) = 0
We know that
Via direct application of the union bound
27
Union boundFor a set of events, probability that at least one of them happens < the sum of the probabilities of the individual events
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
If there are no bad features, then errD(h) < !– Why? Because
28
n = dimensionality
Let us try to see when this will not happen
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
If there are no bad features, then errD(h) < !– Why? Because
29
n = dimensionality
Let us try to see when this will not happen
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
If there are no bad features, then errD(h) < !– Why? Because
30
n = dimensionality
Let us try to see when this will not happen
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?
31
n = dimensionality
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?
32
n = dimensionality
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?
33
n = dimensionality
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
4
Learning Conjunctions: Analysis
• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability
of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)
What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?
34
n = dimensionality
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
4
<(1,1,1,1,1,0,...0,1,1), 1>There was one example of this kind
Learning Conjunctions: Analysis
What we know so far:
But say we have m training examples. Then
There are at most n bad features. So
35
n = dimensionality
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
4
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
What we know so far:
But say we have m training examples. Then
There are at most n bad features. So
36
n = dimensionality
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
4
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
What we know so far:
But say we have m training examples. Then
There are at most n bad features. So
37
n = dimensionality
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
4
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
L1(U ,B) =1
2log |KBB |�
1
2log |KBB + �A1|�
1
2�a2 �
1
2�a3
� 1
2
KX
k=1
kU(k)k2F +1
2�2a4
>(KBB + �A1)
�1a4
+�
2tr(K�1
BBA1) + CONST
L2(U ,B,�) =1
2log |KBB |�
1
2log |KBB +A1|�
1
2a2 + a3
� 1
2�>KBB�+
1
2tr(K�1
BBA1)�1
2
KX
k=1
kU(k)k2F
�(t+1)= (KBB +A1)
�1(A1�
(t)+ a6)
a6 =
X
j
k(B,xij )(2yij � 1)N�k(B,xij )
>�(t)|0, 1�
��(2yij � 1)k(B,xij )
>�(t)�
L1(U ,B, q(v)) = log(p(U)) +Z
q(v) logp(v|B)
q(v)dv +
Xj
Zq(v)Fv(yij ,�)dv
L1
�y,U ,B, q(v)
�
B,UL⇤1(y,U ,B)
L1
�y,U ,B, q(v)
�
(A1,rA1)
(a2,ra2)
(a3,ra3)
(a4,ra4)
(a5,ra4)
L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)
�
+
Zlog
�N (v|0, k(B,B)
�
+
Xj
Zq(v)Fv(yij ,�)dv
sgn� kX
i=1
ci · sgn(w>i x)
�(2)
= sgn� kX
i=1
ciw>i x
�(3)
yiw>xi
kwk < ⌘ (4)
s1 + s2 + s3 � 0 (5)�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
4
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
We want this probability to be small
Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !
That is, we want
We know that 1 – x < e-x. So it is sufficient to require
38
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
We want this probability to be small
Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !
That is, we want
We know that 1 – x < e-x. So it is sufficient to require
39
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
We want this probability to be small
Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !
That is, we want
We know that 1 – x < e-x. So it is sufficient to require
40
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
We want this probability to be small
Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !
That is, we want
We know that 1 – x < e-x. So it is sufficient to require
Or equivalently,
41
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
s1 + s2 + s3 � 0 (5)
�
Pr(A bad feature is not eliminated by one example) 1� ✏
n
Pr(A bad feature survives m examples) �1� ✏
n
�m
Pr(Any bad feature survives m examples) n�1� ✏
n
�m
5
Learning Conjunctions: Analysis
To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is
That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !
What does this mean:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples
42
Poly in n, 1/", 1/!
Learning Conjunctions: Analysis
To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is
That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !
What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples
43
Poly in n, 1/", 1/!
Learning Conjunctions: Analysis
To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is
That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !
What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples
44
Poly in n, 1/", 1/!
Learning Conjunctions: Analysis
To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is
That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !
What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need only 691 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need 691 examples
45
Poly in n, 1/", 1/!
Learning Conjunctions: Analysis
To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is
That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !
How to use this:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples
46
What we have here is a PAC guarantee
Our algorithm is Probably Approximately Correct
top related