practical hints and pac learning theorycs9444/notes10/9444-w3.pdf · practical hints and pac...
TRANSCRIPT
COMP9444 10s2 Learning Theory 1
Practical Hints and PAC Learning Theory
COMP9444 2010
COMP9444 10s2 Learning Theory 2
How good a classifier does a learner pro-duce?
• Training error is the precentage of incorrectly classified dataamong the training data.
• True error is the probability of misclassifying a randomlydrawn data point from the domain (according to the underlyingprobability distribution).
• How does the training error relate to the true error?
COMP9444 2010
COMP9444 10s2 Learning Theory 3
How good a classifier does a learner pro-duce?
• We can estimate the true error by using randomly drawn testdata from the domain that was not shown to the classifierduring training.
• Test error is the percentage of incorrectly classified data amongthe test data.
• Validation error is the percentage of incorrectly classified dataamong the validation data.
COMP9444 2010
COMP9444 10s2 Learning Theory 4
Calibrating MLPs for the learning task athand
• Split the original training data set into two subsets:– Validation data set– training data set
• Use the training data set for resampling to estimate whichnetwork configuration (e.g. how many hidden units to use)seems best.
• Once a network configuration is selected, train the network onthe training set and estimate its true error using the validationset not shown to the network at any time before.
COMP9444 2010
COMP9444 10s2 Learning Theory 5
Resampling methods
• Resampling tries to use the same data multiple times to reduceestimation errors on the true error of a classifier being learnedusing a particular learning method.
• Frequently used resampling methods are– n-fold cross validation (with n being 5 or 10)– and bootstrapping.
COMP9444 2010
COMP9444 10s2 Learning Theory 6
Cross Validation
• Split available training data into n subsets s1, ...,sn.
• For each i set apart subset si for testing. Train the learner on allremaining n−1 subsets and test the result on si.
• Average the test errors over all n learning runs.
COMP9444 2010
COMP9444 10s2 Learning Theory 7
Bootstrapping
• Assume the available training data represents the actualdistribution of data in the underlying domain. I.e. for k datapoints, assume each data point occurs with probability 1/k.
• Draw a set of t training data points by randomly selecting a datapoint from the original collection with probability 1/k for eachdata point (if there are multiple occurrences of the same datapoint, the probability of that data point occurring is deemed tobe proportionally higher.)
• Often t is chosen to be equal to k. This will usually resultin having only about 63% of the original sample, while theremaining 37% are made up of duplicates.
COMP9444 2010
COMP9444 10s2 Learning Theory 8
Bootstrapping
• Train learner on the randomly drawn training sample.
• Test the learned classifier on the original data set (without thetraining set).
• Repeat the random selection, training and testing for a numberof times and average the test error over all runs (typical numberof runs is 30 to 50.)
COMP9444 2010
COMP9444 10s2 Learning Theory 9
Factors Determining the Difficulty of Learn-ing
• The representation of domain objects. I.e. which attributes andhow many.
• The number of training data points.
• The learner, i.e. the class of functions it can potentially learn.
COMP9444 2010
COMP9444 10s2 Learning Theory 10
The Vapnik-Chervonenkis dimension
• Definition
• Various function sets - relevant to learning systems - and theirVC-dimension
• Bounds from probably approximately correct learning(PAC-learning).
• General bounds on the Risk.
COMP9444 2010
COMP9444 10s2 Learning Theory 11
The Vapnik-Chervonenkis dimension
In the following, we consider a boolean classification function f onthe input space X to be equivalent to the subset of X such that∀x ∈ X f (x) = 1.The VC-dimension is a useful combinatorial parameter on sets ofsubsets in the input space, e.g. on function classes F on the inputspace X .Definition We say a set S ⊆ X is shattered by F , if for all subsetss ⊆ S there is at least one function f ∈ F such that s = f ∩S. In otherwords, if {S∩ f | f ∈ F} = 2S.Definition The Vapnik-Chervonenkis dimension of F , VC-dim(F) isthe cardinality of the largest set S ⊆ X shattered by F .COMP9444 2010
COMP9444 10s2 Learning Theory 12
The Vapnik-Chervonenkis dimension
1
2
3
4
The VC-dimension of the set of linear decision functions in the2-dimensional Euclidean space is 3.
COMP9444 2010
COMP9444 10s2 Learning Theory 13
The Vapnik-Chervonenkis dimension
The VC-dimension of the set of linear decision functions in then-dimensional Euclidean space is equal to n+1.
COMP9444 2010
COMP9444 10s2 Learning Theory 14
VC-dimension and PAC-Learning
Theorem Upper Bound (Blumer et al., 1989)
Let L be a learning algorithm that uses F consistently, i.e. that findsan f ∈ F that is consistent with all the data. For any 0 < ε,δ < 1given
(4log( 2δ )+8VC-dim(F) log( 13
ε ))
ε
random examples, L will with probability of at least 1−δ
either produce a function f ∈ F with error ≤ ε
or indicate correctly, that the target function is not in F.
COMP9444 2010
COMP9444 10s2 Learning Theory 15
VC-dimension and PAC-Learning
Theorem Lower Bound (Ehrenfeucht et al., 1992)
Let L be a learning algorithm that uses F consistently. For any0 < ε < 1
8 , 0 < δ < 1100 given less than
VC-dim(F)−132ε
random examples, there is some probability distribution for which Lwill not produce a function f ∈ F with error( f ) ≤ ε with probability1−δ.
COMP9444 2010
COMP9444 10s2 Learning Theory 16
VC-dimension bounds
ε δ VC-dim lower bound upper bound5% 5% 10 6 9192
10% 5% 10 3 40405% 5% 4 2 3860
10% 5% 4 1 170710% 10% 4 1 1677
COMP9444 2010
COMP9444 10s2 Learning Theory 17
The Vapnik-Chervonenkis dimension
X
Y
Let C be the set of all rectangles in the plane.
COMP9444 2010
COMP9444 10s2 Learning Theory 18
The Vapnik-Chervonenkis dimension
X
Y
Let C be the set of all circles in the plane. (The VC dimension ofrectangles on the previous slide is 4.)
COMP9444 2010
COMP9444 10s2 Learning Theory 19
The Vapnik-Chervonenkis dimension
X
Y
Let C be the set of all triangles of same orientation in the plane.(The VC dimension of circles on the previous slide is 3.)
COMP9444 2010
COMP9444 10s2 Learning Theory 20
The Vapnik-Chervonenkis dimension
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-10 -5 0 5 10
f(x,1)f(x,2.5)
The VC-dimension of all functions of the following form is inifinite!
f (x,α) = sin(α x), α ∈ ℜ.
(The VC dimension of triangles on the previous slide is 3.)COMP9444 2010
COMP9444 10s2 Learning Theory 21
The Vapnik-Chervonenkis dimension
The following points on the line
x1 = 10−1, ...,xn = 10−n
can be shattered by functions from this set.
To separate these data into two classes determined by thesequence
δ1, ..,δn ∈ {0,1}
it is sufficient to choose the value of parameter
α = π(n∑i=1
(1−δi)10i +1)
COMP9444 2010
COMP9444 10s2 Learning Theory 22
VC-dimension of some (discrete) functionsets
• linear attributes
1 2 3 4 5 6
COMP9444 2010
COMP9444 10s2 Learning Theory 23
VC-dimension of some (discrete) functionsets
• tree-structured attributes
polygon shape
convex shape concave shape
any shape
COMP9444 2010
COMP9444 10s2 Learning Theory 24
VC-dimension of some (discrete) functionsets
Theorem Let X be the n-dimensional input space. If F is the set ofall functions that are pure conjunctions of attribute-value constraintson X , then
n ≤ VC-dim(F) ≤ 2n
COMP9444 2010
COMP9444 10s2 Learning Theory 25
VC-dimension of some (discrete) functionsets
Theorem Let X be the n-dimensional input space. If F is the set ofall functions that are pure conjunctions of attribute-value constraintson X , that contain at most s atoms (attribute-value constraints),
then
sblog ns c ≤ VC-dim(F) ≤ 4s log(4s
√n)
COMP9444 2010
COMP9444 10s2 Learning Theory 26
VC-Dimensions of Neural Networks
As a good heuristic (Bartlett):VC-dimension ≈ number of parameters
• For linear threshold functions on ℜn,
VC-dim = n+1
(number of parameters is n+1.)
• For linear threshold networks, and fixed depth networks withpiecewise polynomial squashing functions
c1|~W | ≤ VC-dim ≤ c2|~W | log |~W |
where |~W | is number of weights in the network.
COMP9444 2010
COMP9444 10s2 Learning Theory 27
VC-Dimensions of Neural Networks
• Some threshold networks have VC-dim ≥ c|~W | log |~W |.
COMP9444 2010
COMP9444 10s2 Learning Theory 28
VC-Dimensions of Neural Networks
Any function class F that can be computed by a program that takesa real input vector ~x and k real parameters and involves no morethan t of the following operations:
• +,−,×,/ on real numbers
• >,≥,=,≤,<, 6= on real numbers
• output value y ∈ {−1,+1}
has VC-dimension of O(kt).
(See work of Peter Bartlett)
COMP9444 2010
COMP9444 10s2 Learning Theory 29
VC-Dimensions of Neural Networks
Any function class F that can be computed by a program that takesa real input vector ~x and k real parameters and involves no morethan t of the following operations:
• +,−,×,/,eα on real numbers
• >,≥,=,≤,<, 6= on real numbers
• output value y ∈ {−1,+1}
has VC-dimension of O(k2t2).
This includes sigmoid networks, RBF networks, mixture of experts,etc.
(See work of Peter Bartlett)
COMP9444 2010
COMP9444 10s2 Learning Theory 30
VC-dimension Heuristic
For neural networks and decision trees: VC-dimension ≈ size.
Hence, the order of the misclassification probability is no more than
training error+√
sizem
where m is the number of training examples.
This suggests that the number of training examples should growroughly linearly with the size of the hypothesis to be produced.
Less data makes it likely that the learned function is not anear-optimal one. (See work of Peter Bartlett)
COMP9444 2010
COMP9444 10s2 Learning Theory 31
Structural Risk Minimisation (SRM)
The complexity (or capacity) of a function class from which thelearner chooses a function that minimises the empirical risk (i.e.the error on the training data) determines the convergence rate ofthe learner to the optimal function.
For a given number of independently and identically distributedtraining examples, there is a trade-off between the degree to whichthe empirical risk can be minimised and to which the empirical riskwill deviate from the true risk (i.e. the true error - error on unseendata according to the underlying distribution).
COMP9444 2010
COMP9444 10s2 Learning Theory 32
Structural Risk Minimisation (SRM)
F
FFi
n
1
f1 fn
Empirical risk
Confidence interval
f i
Bound on the overall risk
Complexity degree
Consider a partition of the set F of functions from which a functionis chosen as follows: F1 ⊂ F2 ⊂ ... ⊂ Fn...
COMP9444 2010
COMP9444 10s2 Learning Theory 33
Structural Risk Minimisation (SRM)
The general SRM principle:Choose complexity parameter d, e.g. the number of hidden units ina MLP, or the size of a decision tree, and function f ∈ Fd such thatthe following is minimised:
Remp( f )+ c√
VC-dim(Fd)
mwhere m is the number of training examples.The higher the VC dimension is the more likely will the empiricalerror be low.Structural risk minimisation seeks the right balance.COMP9444 2010
COMP9444 10s2 Learning Theory 34
Summary
• The VC-dimension is a useful combinatorial parameter of setsof functions.
• It can be used to estimate the true risk on the basis of theempirical risk and the number of independently and identicallydistributed training examples.
• It can also be used to determine a sufficient number of trainingexamples to learn probably approximately correctly.
• Applications of the VC-dimension for choosing the most suitablesubset of functions for a given number of independently andidentically distributed examples.
• Trading empirical risk against confidence in estimate seemsimportant.
COMP9444 10s2 Learning Theory 35
COMP9444 2010
COMP9444 10s2 Learning Theory 36
Summary (cont.)
Limitations of the VC Learning Theory
• The probabilistic bounds on the required number of examplesare based on a worst case analysis.
• No preference relation on the functions of the learner aremodelled.
• In practice, training examples are not necessarily drawn fromthe same probability distribution as the test examples.
COMP9444 2010