computational learning theory

36
1 Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004 Computational Learning Theory (VC Dimension) 1. Difficulty of machine learning problems 2. Capabilities of machine learning algorithms

Upload: others

Post on 24-Oct-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Learning Theory

1

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Computational Learning Theory (VC Dimension)

1. Difficulty of machine learning problems2. Capabilities of machine learning algorithms

Page 2: Computational Learning Theory

2

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Version Space with associated errors

error=.2r = 0

. .

.

.

.

.

VSH,D

error=.3r = .1

error=.1r = .2 error=.3

r = .4

error=.2r = .3

error=.1r = 0

Hypothesis Space H

error is the true error,

r is the training error

Page 3: Computational Learning Theory

3

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Number of training samples required

)/1ln(|H|(ln1m δε

+≥

• With probability 1−δ , • every hypothesis in H having zero training error will have a true error of at

most ε• Sample complexity for PAC learning grows as the logarithm of

the size of the hypothesis space

Page 4: Computational Learning Theory

4

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Disadvantage of Sample Complexity for finite hypothesis spaces

• Bound in terms of |H| has two disadvantages• Weak bounds

• Bound grows linearly with |H|• For large H, δ > 1, or Probability at least (1-δ) is negative!

• Can overestimate number of samples required• Cannot be applied in case of infinite hypotheses

me|H| εδ −≥

)/1ln(|H|(ln1m δε

+≥

Page 5: Computational Learning Theory

5

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Sample Complexity for infinite hypothesis spaces

• Another measure of the complexity of H called Vapnik-Chervonenkis dimension, or VC(H)• We will use VC(H) instead of |H|• Results in tighter bounds• Allows characterizing sample complexity of infinite

hypothesis spaces and is fairly tight

Page 6: Computational Learning Theory

6

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension

• VC Dimension is a property of a set of functions { f (α) }

• It can be defined for various classes of functions• Bounds that relate the capacity of a learning machine

and its performance

Page 7: Computational Learning Theory

7

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Shattering a Set of Instances

• Complexity of hypothesis space is measured• not by no. of distinct hypotheses |H|• but by no. of distinct instances from X that can be completely

discriminated using H

• Definition: • A set of instances S is shattered by hypothesis space H if

and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy

Page 8: Computational Learning Theory

8

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Shattering a Set of A set of three instances by eight hypotheses

.

. .

Instance Space X

Page 9: Computational Learning Theory

9

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Shattering a Set of Instances

• S is a subset of X ( three instances in example below )• into two subsets

• Ability of H to shatter a set of instances is its capacity to represent target concepts defined over these instances

.

. .

Instance Space X

Hypothesis h

Page 10: Computational Learning Theory

10

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Vapnik-Chervonenkis Dimension

• Definition:VC(H) of hypothesis space H defined over instance space X is the size of the largest finite subset of Xshattered by H

• If arbitrarily large finite sets of X can be shattered by H, then VC(H) = infinity

Page 11: Computational Learning Theory

11

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension

Page 12: Computational Learning Theory

12

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Examples to illustrate VC Dimension

1. Instance space: X = set of real numbers R H is the set of intervals on the real number line

2. Instance space: X = points on the x-y plane H is the set of all linear decision surfaces in the plane

3. Instance space: X = three Boolean literals H is conjunction

a b c

Page 13: Computational Learning Theory

13

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Example 1: VC dimension of 1-dimensional intervals

• X = R (e.g., heights of people)• H is the set of hypotheses of the form a < x <b• Subset containing two instances S ={3.1, 5.7}

• Can S be shattered by H?• Yes, e.g., (1<x<2), (1<x<4),(4<x<7),(1<x<7)• Since we have found a set of two that can be shattered, VC(H) is

at least two• However, no subset of size three can be shattered

• Therefore VC(H) =2• Here |H| is infinite but VC(H) is finite

Page 14: Computational Learning Theory

14

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Example 2: VC Dimension of linear discriminants on a plane

.

000

.

.

Page 15: Computational Learning Theory

15

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

001

.

.

.

Page 16: Computational Learning Theory

16

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

010

.

. .

Page 17: Computational Learning Theory

17

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

011

.

.

.

Page 18: Computational Learning Theory

18

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

.

.

100

.

Page 19: Computational Learning Theory

19

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

101

.

. .

Page 20: Computational Learning Theory

20

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

110

.

.

.

Page 21: Computational Learning Theory

21

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of linear discriminants on a plane

111

.

. .

Page 22: Computational Learning Theory

22

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of points in 2-d space and perceptron

Page 23: Computational Learning Theory

23

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of single perceptron with 2 input units (or points are in 2-d space)

• VC(H) is at least 3 since 3 non-collinear points can be shattered

• It is not 4 since we cannot find a set of four points that can be shattered

• Therefore VC(H)= 3• More generally, VC dimension of linear decision

surfaces in r-dimensional space is r+1

Page 24: Computational Learning Theory

24

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Page 25: Computational Learning Theory

25

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Capacity of a hyperplane

Fraction of dichotomies of n points in d dimensions that are linearly separable

LLLL

LLL

⎪⎪⎩

⎪⎪⎨

+>⎟⎟⎠

⎞⎜⎜⎝

⎛ −

+≤

=∑=

d

0i

n 1dni

1n2

1dn1

)d,n(f

At n=2(d+1), called the capacityof the hyperplane nearly one half of the dichotomies are still linearly separableHyperplane is not overdeterminedUntil number of samples is severaltimes the dimensionality

Page 26: Computational Learning Theory

26

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Example 3: VC Dimension of three Boolean literals and

each hypothesis in H is conjunction

• instance 1 = 100• instance 2 = 010• instance 3 = 001

• Exclude instance i: ~ li• Example: include instance 2 but exclude instances 1

and 3: use hypothesis ~ l1 ^ ~ l3• VC dimension is at least 3• VC dimension of conjunction of n Boolean literals is

exactly n (proof is more difficult)

Page 27: Computational Learning Theory

27

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Sample Complexity and the VC dimension

• How many randomly drawn samples are sufficient to PAC-learn any target concept C ?

• Using VC(H) as the measure of complexity of H

• Analogous to the finite hypothesis case:

)/13(log)H(VC8)/2(log4(1m 22 εδε

+≥

)/1ln(|H|(ln1m δε

+≥

Page 28: Computational Learning Theory

28

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Lower bound on sample complexity

If C is a concept class such that VC( C) >2 then observing fewer than

samples, then L outputs a hypothesis h having errorD (h)>e

⎥⎦⎤

⎢⎣⎡ −

εδ

ε 321)(),/1log(1max CVC

1. Provides number of samples necessary to PAC learn2. Defined in terms of concept class C rather than H

Page 29: Computational Learning Theory

29

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension for Neural Networks

• G-composition of C is the class of all functions that can be implemented on the network G, i.e., it is the hypothesis space representable by network G

• For acyclic layered networks containing s perceptrons, each with r inputs

)log()1(2)( essrCVC sperceptronG +≤

Page 30: Computational Learning Theory

30

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension for Neural Networks

• Bound on number of samples sufficient to learn with probability at least (1-d) any target concept from Cg to within error e

• Does not apply to Backprop since the units are not sigmoid• VC dimension of sigmoid units will at least as great as that of

perceptrons

)/13(log)log()1(16)/2(log4(12 εδ

εessrm ++≥

Page 31: Computational Learning Theory

31

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

VC Dimension of SVMs

Page 32: Computational Learning Theory

32

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Page 33: Computational Learning Theory

33

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Page 34: Computational Learning Theory

34

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

MISTAKE BOUND MODEL OF LEARNING

• How many mistakes will learner make in its predictions before it learns the target concept

• Significant in practical settings where learning must be done while the system is in actual use, rather than in an off-line training stage

• Example, system to learn to approve credit card purchases based on data collected during use

Page 35: Computational Learning Theory

35

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Mistake bound for Find-S algorithm

• Find-S• Initialize h to the most specific hypothesis

• For each positive training instance x• Remove from h any literal that is not satisfied by x

• Output hypothesis h• Total no. of mistakes can be at most n+1

nn llllll ¬∧∧¬∧∧¬∧ L2211

Page 36: Computational Learning Theory

36

Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004

Mistake bound for Halving Algorithm

• Candidate Elimination Algorithm maintains a description of the version space, incrementally refining the version space as each new sample is encountered

• Assume majority vote is used among all hypotheses in the current version space

• Total no. of mistakes can be at most log2|H|