università di milano-bicocca laurea magistrale in informatica

46
1 Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory

Upload: lamond

Post on 25-Feb-2016

37 views

Category:

Documents


1 download

DESCRIPTION

Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory. Computational models of cognitive phenomena. Computing capabilities: Computability theory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Università di Milano-Bicocca Laurea Magistrale in Informatica

1

Universitagrave di Milano-BicoccaLaurea Magistrale in Informatica

Corso di

APPRENDIMENTO E APPROSSIMAZIONE

Prof Giancarlo Mauri

Lezione 4 - Computational Learning Theory

2

Computational models of cognitive phenomena

Computing capabilities Computability theory

Reasoningdeduction Formal logic

Learninginduction

3

A theory of the learnable (Valiant lsquo84)

[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties

the machines can provably learn whole classes of concepts these classes can be characterized

the classes of concepts are appropriate and nontrivial for general-purpose knowledge

the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps

4

A theory of the learnable

We seek general laws that constrain inductive learning relating

Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

5

Probably approximately correct learning

formal computational model which want shed

light on the limits of what can be

learned by a machine analysing the

computational cost of learning algorithms

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 2: Università di Milano-Bicocca Laurea Magistrale in Informatica

2

Computational models of cognitive phenomena

Computing capabilities Computability theory

Reasoningdeduction Formal logic

Learninginduction

3

A theory of the learnable (Valiant lsquo84)

[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties

the machines can provably learn whole classes of concepts these classes can be characterized

the classes of concepts are appropriate and nontrivial for general-purpose knowledge

the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps

4

A theory of the learnable

We seek general laws that constrain inductive learning relating

Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

5

Probably approximately correct learning

formal computational model which want shed

light on the limits of what can be

learned by a machine analysing the

computational cost of learning algorithms

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 3: Università di Milano-Bicocca Laurea Magistrale in Informatica

3

A theory of the learnable (Valiant lsquo84)

[hellip] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn [hellip] Learning machines must have all 3 of the following properties

the machines can provably learn whole classes of concepts these classes can be characterized

the classes of concepts are appropriate and nontrivial for general-purpose knowledge

the computational process by which the machine builds the desired programs requires a ldquofeasiblerdquo (ie polynomial) number of steps

4

A theory of the learnable

We seek general laws that constrain inductive learning relating

Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

5

Probably approximately correct learning

formal computational model which want shed

light on the limits of what can be

learned by a machine analysing the

computational cost of learning algorithms

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 4: Università di Milano-Bicocca Laurea Magistrale in Informatica

4

A theory of the learnable

We seek general laws that constrain inductive learning relating

Probability of successful learning Number of training examples Complexity of hypothesis space Accuracy to which target concept is approximated Manner in which training examples are presented

5

Probably approximately correct learning

formal computational model which want shed

light on the limits of what can be

learned by a machine analysing the

computational cost of learning algorithms

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 5: Università di Milano-Bicocca Laurea Magistrale in Informatica

5

Probably approximately correct learning

formal computational model which want shed

light on the limits of what can be

learned by a machine analysing the

computational cost of learning algorithms

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 6: Università di Milano-Bicocca Laurea Magistrale in Informatica

6

What we want to learn

That is

to determine uniformly good approximations of an unknown function from its value in some sample points

interpolation pattern matching concept learning

CONCEPT = recognizing algorithm

LEARNING = computational description of recognizing algorithms starting from - examples - incomplete specifications

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 7: Università di Milano-Bicocca Laurea Magistrale in Informatica

7

Whatrsquos new in pac learning

Accuracy of results and running time for learning

algorithms

are explicitly quantified and related

A general problem

use of resources (time spacehellip) by computations COMPLEXITY THEORY

Example

Sorting nlogn time (polynomial feasible)

Bool satisfiability 2ⁿ time (exponential intractable)

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 8: Università di Milano-Bicocca Laurea Magistrale in Informatica

8

Learning from examples

DOMAIN

ConceptLEARNER

EXAMPLES

A REPRESENTATION OF A CONCEPTCONCEPT subset of domain

EXAMPLES elements of concept (positive)

REPRESENTATION domainrarrexpressions GOOD LEARNER

EFFICIENT LEARNER

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 9: Università di Milano-Bicocca Laurea Magistrale in Informatica

9

The PAC model

A domain X (eg 01ⁿ Rⁿ) A concept subset of X f sube X or f Xrarr01 A class of concepts F sube 2X

A probability distribution P on X

Example 1

X equiv a square F equiv triangles in the square

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 10: Università di Milano-Bicocca Laurea Magistrale in Informatica

10

The PAC model

Example 2

Xequiv01ⁿ F equiv family of boolean functions

1 if there are at least r ones in (x1hellipxn)fr(x1hellipxn) =

0 otherwise

P a probability distribution on X

Uniform Non uniform

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 11: Università di Milano-Bicocca Laurea Magistrale in Informatica

11

The PAC model

The learning process

Labeled sample ((x0 f(x0)) (x1 f(x1)) hellip (xn f(xn))

Hypothesis a function h consistent with the sample (ie h(xi) = f(xi) i)

Error probability Perr(h(x)nef(x) xX)

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 12: Università di Milano-Bicocca Laurea Magistrale in Informatica

12

LEARNERExamples generatorwith probabilitydistribution p

Inference procedure A

t examples

Hypothesis h (implicit representation of a concept)

The learning algorithm A is good if the hypothesis h is ldquoALMOST ALWAYSrdquoldquoCLOSE TOrdquo the target concept c

TEACHER

The PAC model

X fF X F

(x1f(x1)) hellip (xtf(xt)))

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 13: Università di Milano-Bicocca Laurea Magistrale in Informatica

13

f h

x random choice

Given an approximation parameter (0ltle1) h is an ε-approximationof f if dp(fh)le

ldquoALMOST ALWAYSrdquo

Confidence parameter

(0 lt le 1)

The ldquomeasurerdquo of sequences of examples randomly choosen according to P such that h is an ε-approximation of f is at least 1-

ldquoCLOSE TOrdquo

METRIC given P

dp(fh) = Perr = Px f(x)neh(x)

The PAC model

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 14: Università di Milano-Bicocca Laurea Magistrale in Informatica

14

Generator ofexamples

Learner h

F concept classS set of labeled samples from a concept in F A S F such that

I) A(S) consistent with S

II) P(Perrlt ) gt 1-

0ltlt1 fF mN S st |S|gem

Learning algorithm

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 15: Università di Milano-Bicocca Laurea Magistrale in Informatica

15

COMPUTATIONAL RESOURCES

SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning)

DEF 1 a concept class F = n=1F n is statistically PAC learnable if there

is a learning algorithm with sample size t = t(n 1 1) bounded by some polynomial function in n 1 1

Look for algorithms which use ldquoreasonablerdquo amount of computational resources

The efficiency issue

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 16: Università di Milano-Bicocca Laurea Magistrale in Informatica

16

The efficiency issue

POLYNOMIAL PAC STATISTICAL PAC

DEF 2 a concept class F = n=1F n is polynomially PAC learnable

if there is a learning algorithm with running time bounded by some polynomial function in n 1 1

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 17: Università di Milano-Bicocca Laurea Magistrale in Informatica

17

n = f 0 1n 0 1 The set of boolean functions in n

variables

Fn n A class of conceptsExample 1Fn = clauses with literals in

Example 2Fn = linearly separable functions in n variables

nn xxxx 11

nk xxxxxx ororororor 2123

( ) sum minus λkkXWHS

REPRESENTATION

- TRUTH TABLE (EXPLICIT)- BOOLEAN CIRCUITS (IMPLICIT)

BOOLEANCIRCUITS

BOOLEANFUNCTIONS

Learning boolean functions

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 18: Università di Milano-Bicocca Laurea Magistrale in Informatica

18

bull BASIC OPERATIONSbull COMPOSITION

( )minusorand

in m variables in n variables

CIRCUIT Finite acyclic directed graph

or

Output node

Basic operations

Input nodes

Given an assignment x1 hellip xn 0 1 to the input variables the output node computes the corresponding value

oror

orand

1X 2X 3X

or

[f(g1 hellip gm)](x) = f(g1(x) hellip gm(x))

Boolean functions and circuits

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 19: Università di Milano-Bicocca Laurea Magistrale in Informatica

19

Fn n

Cn class of circuits which compute all and only the functions in Fn

Uinfin

=

=1n

nFF Uinfin

=

=1n

nCC

Algorithm A to learn F by C bull INPUT (nεδ)

bullThe learner computes t = t(n 1 1) (t=number of examples sufficient to learn with accuracy ε and confidence δ)bull The learner asks the teacher for a labelled t-sample

bull The learner receives the t-sample S and computes C = An(S)

bull Output C (C= representation of the hypothesis)

Note that the inference procedure A receives as input the integer n and a t-sample on 01n and outputs An(S) = A(n S)

Boolean functions and circuits

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 20: Università di Milano-Bicocca Laurea Magistrale in Informatica

20

An algorithm A is a learning algorithm with sample size t(n 1 1) for a concept class using the class of representations If for all nge1 for all fFn for all 0lt lt1 and for every probability distribution p over 01n the following holds

nn FUF 1=infin= mm CUC 1=

infin=

If the inference procedure An receives as input a t-sample it outputs a representation cCn of a function g that is probably approximately correct that is with probability at least 1- a t-sample is chosen such that the function g inferred satisfies

Px f(x)neg(x) le

g is ndashgood g is an ndashapproximation of fg is ndashbad g is not an ndashapproximation of f

NOTE distribution free

Boolean functions and circuits

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 21: Università di Milano-Bicocca Laurea Magistrale in Informatica

21

Statistical PAC learning

DEF An inference procedure An for the class F n is consistent if

given the target function fF n for every t-sample

S = (ltx1b1gt hellip ltxtbtgt) An(S) is a representation of a function

g ldquoconsistentrdquo with S ie g(x1) = b1 hellip g(xt) = bt

DEF A learning algorithm A is consistent if its inference procedure is consistent

PROBLEM Estimate upper and lower bounds on the sample size

t = t(n 1 1)Upper bounds will be given for consistent algorithms

Lower bounds will be given for arbitrary algorithms

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 22: Università di Milano-Bicocca Laurea Magistrale in Informatica

22

THEOREM t(n 1 1) le -1ln(F n) +ln(1)

PROOF Prob(x1 hellip xt) g (g(x1)=f(x1) hellip g(xt)=f(xt) g -bad) le

le Prob (g(x1) = f(x1) hellip g(xt) = f(xt)) le

Impose F n e-t le

Independent events

g is ε-bad

P(AUB)leP(A)+P(B)

g ε-bad

NOTE - F n must be finite

le Prob (g(xi) = f(xi)) leg ε-bad i=1 hellip t

le (1-)t le F n(1-)t le F ne-t g ε-bad

A simple upper bound

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 23: Università di Milano-Bicocca Laurea Magistrale in Informatica

23

X domainF 2X class of conceptsS = (x1 hellip xt) t-sample

f S g iff f(xi) = g(xi) xi S undistinguishable by S

F (S) = (F S) index of F wrt S

Problem uniform convergence of relative frequencies to their probabilities

Vapnik-Chervonenkis approach (1971)

S1 S2

MF (t) = maxF (S) S is a t-sample growth function

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 24: Università di Milano-Bicocca Laurea Magistrale in Informatica

24

FACT

THEOREM

A general upper bound

Prob(x1 hellip xt) g (g -bad g(x1) = f(x1) hellip g(xt) = f(xt))le 2mF2te-t2

mF (t) le 2t

mF (t) le F (this condition gives immediately

the simple upper bound) mF (t) = 2t and jltt mF (j) = 2j

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 25: Università di Milano-Bicocca Laurea Magistrale in Informatica

25

d t

)(tmF

F

)(infinFm

t2

DEFINITION

FUNDAMENTAL PROPERTY

=)(tmF1

2

0

minusle⎟⎠⎞⎜⎝

⎛le⎟⎠⎞⎜⎝

⎛le

le

=

sum

Kk

t

tt

t

t

BOUNDED BY APOLYNOMIAL IN t

Graph of the growth function

d = VCdim(F ) = max t mF(t) = 2t

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 26: Università di Milano-Bicocca Laurea Magistrale in Informatica

26

THEOREMIf dn = VCdim(Fn)

then t(n 1 1) le max (4 log(2) (8dn)log(13)PROOF

Impose 2mFn2te-et2 le

A lower bound on t(n 1 1) Number of examples which are necessary for arbitrary algorithms

THEOREMFor 0lele1 and le1100

t(n 1 1) ge max ((1- ) ln(1) (dn-1)32)

Upper and lower bounds

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 27: Università di Milano-Bicocca Laurea Magistrale in Informatica

27

Ie the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F

If F (S) = 2S we say that S is shattered by F

The Vapnik-Chervonenkis dimension of F is the cardinality of the largest finite set of points S X that is shattered by F

An equivalent definition of VCdim

F (S) = (f-1(1)(x1 hellip xt) | fF )

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 28: Università di Milano-Bicocca Laurea Magistrale in Informatica

28

1300log80032000log400)11(

00100103)(

sdotle

===

MAXnt

FVC DIM

δε

δε

Sufficient 24000

⎭⎬⎫

⎩⎨⎧ sdot

minusge 100

32131000ln100)11( MAXnt δε

690 Necessary

Learn the family f of circles contained in the square

Example 1

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 29: Università di Milano-Bicocca Laurea Magistrale in Informatica

29

otherwiseXif

XWHSXXfTHATSUCHWWLf

nkkkn

nn

001

)()()(

11

1

ge

minus=

rArrisin

sum=

λ

λ

HS(x)=

22

1)(n

n

nDIM

L

nLVC

le

+=

SIMPLE UPPER BOUND

))1ln((1)11( 2 +le nnt

UPPER BOUND USING

⎭⎬⎫

⎩⎨⎧ sdot

+le

13)1(82log4)11( nMAXnt GROWS LINEARLY WITH n

)( nDIM LVC

Learn the family of linearly separable boolean functions in n variables Ln

Example 2

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 30: Università di Milano-Bicocca Laurea Magistrale in Informatica

30

Consider the class L2 of linearly separable functions in two variables

3)(3)(

1)(

2

2

ge=

+=

LVCLVC

nLVC

IM

IM

nIM

4)( 2 ltLVC IM

The green point cannot beseparated from the other three

No straight line can separatethe green from the red points

Example 2

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 31: Università di Milano-Bicocca Laurea Magistrale in Informatica

31

Classi di formule booleane

Monomi x1x2 hellip xk

DNF m1m2 hellip mj (mj monomi)

Clausole x1x2 hellip xk

CNF c1c2 hellip cj (cj clausole)

k-DNF le k letterali nei monomi

k-term-DNF le k monomi

k-CNF le k letterali nelle clausole

k-clause-CNF le k clausole

Formule monotone non contengono letterali negati

m-formule ogni variabile appare al piugrave una volta

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 32: Università di Milano-Bicocca Laurea Magistrale in Informatica

32

Th (Valiant)I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo ixix

xxgii 01 ==

sdotequiv ππin tutti gli es in tutti gli es

NB Lrsquoapprendibilitagrave egrave non monotona A B se B appr allora A apprsube

endHdaxcancella

elseHdaxcancella

thenjesifdontojfor

generaesbegin

doBtoiforxxxxxxH

j

j

nn

0)(1

)(

1 2211

==

=

==

Th i monomi non sono apprendibili da esempi negativi

I risultati

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 33: Università di Milano-Bicocca Laurea Magistrale in Informatica

33

1) K-CNF apprendibili da soli esempi positivi1b) K-DNF apprendibili da soli esempi negativi2) (K-DNF K-CNF) apprendibili da es (K-DNF K-CNF) positivi e negativiand

or

Kforall

3) la classe delle K-decision lists egrave apprendibile

)0()(

1)min(|min

10||)))((( 11

esistenonisebvCalloravni

booleanovettorevDLKCbkmmonomiom

conbmbmDLK

i

iii

jj

===

minusisinle

equivminus

Th Ogni K-DNF (o K-CNF)-formula egrave rappresentabile da una K-DL piccola

Risultati positivi

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 34: Università di Milano-Bicocca Laurea Magistrale in Informatica

34

)()(

freeondistributisensoinNPRPse

minusne

1) Le m-formule non sono apprendibili

2) Le funzioni booleane a soglia non sono apprendibili

3) Per K ge 2 le formule K-term-DNF non sono apprendibili

ge

Risultati negativi

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 35: Università di Milano-Bicocca Laurea Magistrale in Informatica

35

Mistake bound model

So far how many examples needed to learn What about how many mistakes before

convergence Letrsquos consider similar setting to PAC learning

Instances drawn at random from X according to

distribution D Learner must classify each instance before receiving

correct classification from teacher Can we bound the number of mistakes learner makes

before converging

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 36: Università di Milano-Bicocca Laurea Magistrale in Informatica

36

Mistake bound model

Learner Receives a sequence of training examples x Predicts the target value f(x) Receives the correct target value from the trainer Is evaluated by the total number of mistakes it makes

before converging to the correct hypothesis

Ie Learning takes place during the use of the system

not off-line Ex prediction of fraudolent use of credit cards

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 37: Università di Milano-Bicocca Laurea Magistrale in Informatica

37

Mistake bound for Find-S

Consider Find-S when H = conjunction of boolean literals

FIND-S

Initialize h to the most specific hypothesis in

Hx1x1x2x2 hellip xnxn

For each positive training instance x Remove from h any literal not satisfied by x

Output h

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 38: Università di Milano-Bicocca Laurea Magistrale in Informatica

38

Mistake bound for Find-S

If C H and training data noise free Find-S converges to an exact hypothesis

How many errors to learn cH (only positive examples can be misclassified)

The first positive example will be misclassified and n literals in the initial hypothesis will be eliminated

Each subsequent error eliminates at least one literal mistakes le n+1 (worst case for the ldquototalrdquo concept x c(x)=1)

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 39: Università di Milano-Bicocca Laurea Magistrale in Informatica

39

Mistake bound for Halving A version space is maintained and refined (eg

Candidate-elimination) Prediction is based on majority vote among the

hypotheses in the current version space ldquoWrongrdquo hypotheses are removed (even if x is

exactly classified) How many errors to exactly learn cH (H finite)

Mistake when the majority of hypotheses misclassifies x These hypotheses are removed For each mistake the version space is at least halved At most log2(|H|) mistakes before exact learning (eg

single hypothesis remaining) Note learning without mistakes possible

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 40: Università di Milano-Bicocca Laurea Magistrale in Informatica

40

Optimal mistake bound

Question what is the optimal mistake bound (ie lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C assuming H=C

Formally for any learning algorithm A and any target concept c

MA(c) = max mistakes made by A to exactly learn c over all

possible training sequences MA(C) = maxcC MA(c)

Note Mfind-S(C) = n+1

MHalving(C) le log2(|C|) Opt(C) = minA MA(C)

ie of mistakes made for the hardest target concept in C using the hardest training sequence by the best algorithm

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 41: Università di Milano-Bicocca Laurea Magistrale in Informatica

41

Optimal mistake bound

Theorem (Littlestone 1987)

VC(C) le Opt(C) le MHalving(C) le log2(|C|) There exist concept classes for which

VC(C) = Opt(C) = MHalving(C) = log2(|C|)

eg the power set 2X of X for which it holds

VC(2X) = |X| = log2(|2X|) There exist concept classes for which VC(C) lt Opt(C) lt MHalving(C)

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 42: Università di Milano-Bicocca Laurea Magistrale in Informatica

42

Weighted majority algorithm

Generalizes Halving Makes predictions by taking a weighted vote among a pole of prediction algorithms

Learns by altering the weight associated with each prediction algorithm

It does not eliminate hypotheses (ie algorithms) inconsistent with some training examples but just reduces its weight so is able to accommodate inconsistent training data

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 43: Università di Milano-Bicocca Laurea Magistrale in Informatica

43

Weighted majority algorithm

i wi = 1

training example (x c(x))

q0 = q1 = 0

prediction algorithm ai

If ai(x)=0 then q0 = q0 + wi

If ai(x)=1 then q1 = q1 + wi

if q1 gt q0 then predict c(x)=1

if q1 lt q0 then predict c(x)=0

if q1 gt q0 then predict c(x)=0 or 1 at random

prediction algorithm ai do If ai(x)nec(x) then wi = wi (0lelt1)

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 44: Università di Milano-Bicocca Laurea Magistrale in Informatica

44

Weighted majority algorithm (WM)

Coincides with Halving for =0 Theorem - D any sequence of training examples A any set of n prediction algorithms k min of mistakes made by any ajA for D =12 Then W-M makes at most

24(k+log2n)

mistakes over D

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 45: Università di Milano-Bicocca Laurea Magistrale in Informatica

45

Weighted majority algorithm (WM)

Proof Since aj makes k mistakes (best in A) its final

weight wj will be (12)k The sum W of the weights associated with all n algorithms in A is initially n and for each mistake made by WM is reduced to at most (34)W because the ldquowrongrdquo algorithms hold at least 12 of total weight that will be reduced by a factor of 12

The final total weight W is at most n(34)M where M is the total number of mistakes made by WM over D

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46
Page 46: Università di Milano-Bicocca Laurea Magistrale in Informatica

46

Weighted majority algorithm (WM)

But the final weight wj cannot be greater than the

final total weight W hence(12)k le n(34)M

from which

M le le 24(k+log2n)

Ie the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool plus a term that grows only logarithmically in the size of the pool

(k+log2 n)-log2 (34)

  • Universitagrave di Milano-Bicocca Laurea Magistrale in Informatica
  • Computational models of cognitive phenomena
  • A theory of the learnable (Valiant lsquo84)
  • A theory of the learnable
  • Probably approximately correct learning
  • What we want to learn
  • Whatrsquos new in pac learning
  • Learning from examples
  • The PAC model
  • The PAC model
  • Slide 11
  • Slide 12
  • Slide 13
  • Learning algorithm
  • The efficiency issue
  • Slide 16
  • Learning boolean functions
  • Boolean functions and circuits
  • Slide 19
  • Slide 20
  • Statistical PAC learning
  • A simple upper bound
  • Vapnik-Chervonenkis approach (1971)
  • A general upper bound
  • Graph of the growth function
  • Upper and lower bounds
  • An equivalent definition of VCdim
  • Example 1
  • Example 2
  • Slide 30
  • Classi di formule booleane
  • I risultati
  • Risultati positivi
  • Risultati negativi
  • Mistake bound model
  • Slide 36
  • Mistake bound for Find-S
  • Slide 38
  • Mistake bound for Halving
  • Optimal mistake bound
  • Slide 41
  • Weighted majority algorithm
  • Slide 43
  • Weighted majority algorithm (WM)
  • Slide 45
  • Slide 46