practical hints and pac learning theorycs9444/notes10/9444-w3.pdf · practical hints and pac...

COMP9444 10s2 Learning Theory 1

Practical Hints and PAC Learning Theory

COMP9444 2010


How good a classifier does a learner pro-duce?

• Training error is the precentage of incorrectly classified dataamong the training data.

• True error is the probability of misclassifying a randomlydrawn data point from the domain (according to the underlyingprobability distribution).

• How does the training error relate to the true error?

COMP9444 2010


How good a classifier does a learner pro-duce?

• We can estimate the true error by using randomly drawn testdata from the domain that was not shown to the classifierduring training.

• Test error is the percentage of incorrectly classified data amongthe test data.

• Validation error is the percentage of incorrectly classified dataamong the validation data.

COMP9444 2010


Calibrating MLPs for the learning task athand

• Split the original training data set into two subsets:– Validation data set– training data set

• Use the training data set for resampling to estimate whichnetwork configuration (e.g. how many hidden units to use)seems best.

• Once a network configuration is selected, train the network onthe training set and estimate its true error using the validationset not shown to the network at any time before.

COMP9444 2010


Resampling methods

• Resampling tries to use the same data multiple times to reduceestimation errors on the true error of a classifier being learnedusing a particular learning method.

• Frequently used resampling methods are– n-fold cross validation (with n being 5 or 10)– and bootstrapping.

COMP9444 2010


Cross Validation

• Split available training data into n subsets s1, ...,sn.

• For each i set apart subset si for testing. Train the learner on allremaining n−1 subsets and test the result on si.

• Average the test errors over all n learning runs.

COMP9444 2010


Bootstrapping

• Assume the available training data represents the actualdistribution of data in the underlying domain. I.e. for k datapoints, assume each data point occurs with probability 1/k.

• Draw a set of t training data points by randomly selecting a datapoint from the original collection with probability 1/k for eachdata point (if there are multiple occurrences of the same datapoint, the probability of that data point occurring is deemed tobe proportionally higher.)

• Often t is chosen to be equal to k. This will usually resultin having only about 63% of the original sample, while theremaining 37% are made up of duplicates.

COMP9444 2010


Bootstrapping

• Train learner on the randomly drawn training sample.

• Test the learned classifier on the original data set (without thetraining set).

• Repeat the random selection, training and testing for a numberof times and average the test error over all runs (typical numberof runs is 30 to 50.)

COMP9444 2010


Factors Determining the Difficulty of Learn-ing

• The representation of domain objects. I.e. which attributes andhow many.

• The number of training data points.

• The learner, i.e. the class of functions it can potentially learn.

COMP9444 2010


The Vapnik-Chervonenkis dimension

• Definition

• Various function sets - relevant to learning systems - and theirVC-dimension

• Bounds from probably approximately correct learning(PAC-learning).

• General bounds on the Risk.

COMP9444 2010



In the following, we consider a boolean classification function f onthe input space X to be equivalent to the subset of X such that∀x ∈ X f (x) = 1.The VC-dimension is a useful combinatorial parameter on sets ofsubsets in the input space, e.g. on function classes F on the inputspace X .Definition We say a set S ⊆ X is shattered by F , if for all subsetss ⊆ S there is at least one function f ∈ F such that s = f ∩S. In otherwords, if {S∩ f | f ∈ F} = 2S.Definition The Vapnik-Chervonenkis dimension of F , VC-dim(F) isthe cardinality of the largest set S ⊆ X shattered by F .COMP9444 2010



1

2

3

4

The VC-dimension of the set of linear decision functions in the2-dimensional Euclidean space is 3.

COMP9444 2010



The VC-dimension of the set of linear decision functions in then-dimensional Euclidean space is equal to n+1.

COMP9444 2010


VC-dimension and PAC-Learning

Theorem Upper Bound (Blumer et al., 1989)

Let L be a learning algorithm that uses F consistently, i.e. that findsan f ∈ F that is consistent with all the data. For any 0 < ε,δ < 1given

(4log( 2δ )+8VC-dim(F) log( 13

ε ))

ε

random examples, L will with probability of at least 1−δ

either produce a function f ∈ F with error ≤ ε

or indicate correctly, that the target function is not in F.

COMP9444 2010


VC-dimension and PAC-Learning

Theorem Lower Bound (Ehrenfeucht et al., 1992)

Let L be a learning algorithm that uses F consistently. For any0 < ε < 1

8 , 0 < δ < 1100 given less than

VC-dim(F)−132ε

random examples, there is some probability distribution for which Lwill not produce a function f ∈ F with error( f ) ≤ ε with probability1−δ.

COMP9444 2010


VC-dimension bounds

ε δ VC-dim lower bound upper bound5% 5% 10 6 9192

10% 5% 10 3 40405% 5% 4 2 3860

10% 5% 4 1 170710% 10% 4 1 1677

COMP9444 2010



X

Y

Let C be the set of all rectangles in the plane.

COMP9444 2010



X

Y

Let C be the set of all circles in the plane. (The VC dimension ofrectangles on the previous slide is 4.)

COMP9444 2010



X

Y

Let C be the set of all triangles of same orientation in the plane.(The VC dimension of circles on the previous slide is 3.)

COMP9444 2010



-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-10 -5 0 5 10

f(x,1)f(x,2.5)

The VC-dimension of all functions of the following form is inifinite!

f (x,α) = sin(α x), α ∈ ℜ.

(The VC dimension of triangles on the previous slide is 3.)COMP9444 2010



The following points on the line

x1 = 10−1, ...,xn = 10−n

can be shattered by functions from this set.

To separate these data into two classes determined by thesequence

δ1, ..,δn ∈ {0,1}

it is sufficient to choose the value of parameter

α = π(n∑i=1

(1−δi)10i +1)

COMP9444 2010


VC-dimension of some (discrete) functionsets

• linear attributes

1 2 3 4 5 6

COMP9444 2010



• tree-structured attributes

polygon shape

convex shape concave shape

any shape

COMP9444 2010



Theorem Let X be the n-dimensional input space. If F is the set ofall functions that are pure conjunctions of attribute-value constraintson X , then

n ≤ VC-dim(F) ≤ 2n

COMP9444 2010



Theorem Let X be the n-dimensional input space. If F is the set ofall functions that are pure conjunctions of attribute-value constraintson X , that contain at most s atoms (attribute-value constraints),

then

sblog ns c ≤ VC-dim(F) ≤ 4s log(4s

√n)

COMP9444 2010


VC-Dimensions of Neural Networks

As a good heuristic (Bartlett):VC-dimension ≈ number of parameters

• For linear threshold functions on ℜn,

VC-dim = n+1

(number of parameters is n+1.)

• For linear threshold networks, and fixed depth networks withpiecewise polynomial squashing functions

c1|~W | ≤ VC-dim ≤ c2|~W | log |~W |

where |~W | is number of weights in the network.

COMP9444 2010



• Some threshold networks have VC-dim ≥ c|~W | log |~W |.

COMP9444 2010



Any function class F that can be computed by a program that takesa real input vector ~x and k real parameters and involves no morethan t of the following operations:

• +,−,×,/ on real numbers

• >,≥,=,≤,<, 6= on real numbers

• output value y ∈ {−1,+1}

has VC-dimension of O(kt).

(See work of Peter Bartlett)

COMP9444 2010



Any function class F that can be computed by a program that takesa real input vector ~x and k real parameters and involves no morethan t of the following operations:

• +,−,×,/,eα on real numbers

• >,≥,=,≤,<, 6= on real numbers

• output value y ∈ {−1,+1}

has VC-dimension of O(k2t2).

This includes sigmoid networks, RBF networks, mixture of experts,etc.

(See work of Peter Bartlett)

COMP9444 2010


VC-dimension Heuristic

For neural networks and decision trees: VC-dimension ≈ size.

Hence, the order of the misclassification probability is no more than

training error+√

sizem

where m is the number of training examples.

This suggests that the number of training examples should growroughly linearly with the size of the hypothesis to be produced.

Less data makes it likely that the learned function is not anear-optimal one. (See work of Peter Bartlett)

COMP9444 2010


Structural Risk Minimisation (SRM)

The complexity (or capacity) of a function class from which thelearner chooses a function that minimises the empirical risk (i.e.the error on the training data) determines the convergence rate ofthe learner to the optimal function.

For a given number of independently and identically distributedtraining examples, there is a trade-off between the degree to whichthe empirical risk can be minimised and to which the empirical riskwill deviate from the true risk (i.e. the true error - error on unseendata according to the underlying distribution).

COMP9444 2010



F

FFi

n

1

f1 fn

Empirical risk

Confidence interval

f i

Bound on the overall risk

Complexity degree

Consider a partition of the set F of functions from which a functionis chosen as follows: F1 ⊂ F2 ⊂ ... ⊂ Fn...

COMP9444 2010



The general SRM principle:Choose complexity parameter d, e.g. the number of hidden units ina MLP, or the size of a decision tree, and function f ∈ Fd such thatthe following is minimised:

Remp( f )+ c√

VC-dim(Fd)

mwhere m is the number of training examples.The higher the VC dimension is the more likely will the empiricalerror be low.Structural risk minimisation seeks the right balance.COMP9444 2010


Summary

• The VC-dimension is a useful combinatorial parameter of setsof functions.

• It can be used to estimate the true risk on the basis of theempirical risk and the number of independently and identicallydistributed training examples.

• It can also be used to determine a sufficient number of trainingexamples to learn probably approximately correctly.

• Applications of the VC-dimension for choosing the most suitablesubset of functions for a given number of independently andidentically distributed examples.

• Trading empirical risk against confidence in estimate seemsimportant.


COMP9444 2010


Summary (cont.)

Limitations of the VC Learning Theory

• The probabilistic bounds on the required number of examplesare based on a worst case analysis.

• No preference relation on the functions of the learner aremodelled.

• In practice, training examples are not necessarily drawn fromthe same probability distribution as the test examples.

COMP9444 2010

practical hints and pac learning theorycs9444/notes10/9444-w3.pdf · practical hints and pac...

Documents