supervised learning: the setup computational learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...

Computational Learning Theory: An Analysis of a Conjunction Learner

MachineLearningFall2017

SupervisedLearning:TheSetup

Machine LearningSpring 2018

The slides are mainly from Vivek Srikumar

This section

1. Analyze a simple algorithm for learning conjunctions

2. Define the PAC model of learning

3. Make formal connections to the principle of Occam’s razor

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

The true function

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm (Elimination)

• Discard all negative examples

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

• Discard all negative examples• Build a conjunction using the features

that are common to all positive conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Positive examples eliminate irrelevant features

Training data– <(1,1,1,1,1,1,…,1,1), 1>

– <(1,1,1,0,0,0,…,0,0), 0>

– <(1,1,1,1,1,0,...0,1,1), 1>

– <(1,0,1,1,1,0,...0,1,1), 0>

– <(1,1,1,1,1,0,...0,0,1), 1>

– <(1,0,1,0,0,0,...0,1,1), 0>

– <(1,1,1,1,1,1,…,0,1), 1>

– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm:

Clearly this algorithm produces a conjunction that is consistent with the data, that is errS(h) = 0, if the target function is a monotone conjunctionExercise: Why?

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examplesWhy?

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present

Claim 1: h will only make mistakes on positive future examples

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our

example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as

negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set,

every relevant feature must have been present

Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If

then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".

If we see these many training examples, then the algorithm will produce a conjunction that, with high probability, will

make few errors

Poly in n, 1/!, 1/"

Let’s prove this assertion

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

Remember that there will only be mistakes on positive examples for this toy problem

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

p(x1): Probability that this situation occurs

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

We know that

Via direct application of the union bound

We know that

Union boundFor a set of events, probability that at least one of them happens < the sum of the probabilities of the individual events

• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad features, then errD(h) < !– Why? Because

n = dimensionality

Let us try to see when this will not happen

n = dimensionality

What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2log |KBB + �A1|�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

<(1,1,1,1,1,0,...0,1,1), 1>There was one example of this kind

What we know so far:

But say we have m training examples. Then

There are at most n bad features. So

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature survives m examples) �1� ✏

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

Pr(Any bad feature survives m examples) n�1� ✏

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

2�a2 �

2�a3

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

2log |KBB +A1|�

2a2 + a3

2�>KBB�+

2tr(K�1

BBA1)�1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Zq(v)Fv(yij ,�)dv

�y,U ,B, q(v)

B,UL⇤1(y,U ,B)

�y,U ,B, q(v)

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�N (v|0, k(B,B)

Zq(v)Fv(yij ,�)dv

sgn� kX

ci · sgn(w>i x)

�(2)

= sgn� kX

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

That is, we want

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

That is, we want

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

That is, we want

Or equivalently,

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

Poly in n, 1/", 1/!

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

Poly in n, 1/", 1/!

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

Poly in n, 1/", 1/!

What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need only 691 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need 691 examples

Poly in n, 1/", 1/!

How to use this:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

What we have here is a PAC guarantee

Our algorithm is Probably Approximately Correct

supervised learning: the setup computational learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...

Documents

zhe. ifralestine

computational learning

computational and statistical learning...

computational morphology: machine learning of...

cs 391l: machine learning: computational learning theory

computational accounts of human learning bias

computational learning theory rn, chapter 18.5. 3...

learning gaze following in space: a computational model...

other topologies - computational intelligence, learning

zhe wang physics department

computational learning theory

zhe peng - tuni

computational creativity and machine learning ·...

machine learning for computational...

computational learning theory - citeseerx

advanced computational econometrics: machine learning

portfolio by zhe huang

early stopping for computational learning -...

zhe huangm sc

computational learning of morphology