structural risk minimization - svcl• final paper: 70 % (50 % for content, 20 % for writing) •...
TRANSCRIPT
Structural risk minimization
Nuno Vasconcelos ECE Department, UCSD
2
Course projectsrecall: project deliverables are:• project report due Tuesday, June 12• bring to class on presentation day, final, no exceptions!• June 12, 1:00-5:00 PM, WLH 2114
we have 4h and 15 presentations• will make it 15 min/each• bring your talk on a mini-disk or email before
grading:• project: 70% of total grade• project grading:
• final paper: 70 % (50 % for content, 20 % for writing)• presentation 30 %
3
Margins, VC dimension, and SRMwe have been talking about techniques that achieve good generalization by maximizing the marginit turns out that the quantity of interest is the so-called VC dimension of the family of functions implemented by the learning machinethe margin allows us to control this dimension and this is what makes it important structural risk minimization tries to achieve the optimal balance between • empirical risk and • a generalization bound that depends on the VC dimension
4
Loss functions and Riskgoal of the learning machine: to find the set of parameters that minimizes the risk (expected value of the loss)
in practice it is impossible to evaluate the risk, because we do not know what PX,Y(x,y) is.all we have is a training set
we estimate the risk by the empirical risk on this training set
{ } dxdyxfyLyxPxfyLEfR YXYX )](,[),()](,[)( ,, ∫==
( ) ( ){ }nn yxyxD ,,,, 11 K=
∑=
=n
iiiemp xfyL
nR
1)](,[1)( α
5
The law of large numbersTheorem: if xi are iid such that for all ε > 0 we have
noting that
the theorem seems to indicate that
i.e. empirical risk converges to the risk exponentially fastthis is the best that one could hope for, there seems to be no reason to use anything other than ERM
{ } 2
2
)(2][ abn
nn eSESP −−
≤≥−ε
ε
∑=∈i
ini xn
Sbax 1 and ],[
ni
iemp Sn
fR == ∑ξ1][ ][][][ nSEEfR == ξ
( ) nemp efRfRP
2
][][ εε −≤≥−
6
Empirical risk vs risksince we know that ERM is not that good, something must be wrongthe problem is that the bounds assume independent errors ξi
since we are choosing f (by ERM) so that the mean of ξiis as small as possible, this is not the casewe need to look for alternative ways to understand the relationship between the two risksfor them to be equivalent we need the ERM solution f* to converge to the lowest value of R[f]this turns out not to be possible unless we restrict F
7
Consistency of ERMit turns out that there is a simple condition under which ERM converges to the minimum riskTheorem: (VC) The condition
is necessary and sufficient for consistency of ERMthis shows that consistency depends on the class of functions F, but is not terribly useful in practicewe next look at properties of F that ensure convergence
the first thing to do is to try to bound this quantity
( ) 0][][suplim =⎥⎦⎤
⎢⎣⎡ >−
ℑ∈∞→εfRfRP emp
fn
8
VC Boundsthe probability
is easy to bound when F is finitelet F be the set F = {f1, ..., fM} and
the set of samples for which the risks obtained with the ithfunction differ by more than εthen, if M=2,
( ) ⎥⎦⎤
⎢⎣⎡ >−
ℑ∈ε][][sup fRfRP emp
f
( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC
( ) ( )( ) ( ) ( ) ( ) ( )211121
21][][sup
εεεεεε
εεε
CPCPCCPCPCP
CCPfRfRP empf
+≤∩−+=
∪=⎥⎦⎤
⎢⎣⎡ >−
ℑ∈
9
VC Boundsin general,
this is called the union boundrecalling that
and noting that the fi are fixed, from which the errors are independent, we can now • just apply the LLN to each of the P(Cε
i)• each of them is bounded by O(e-nε)• the overall bound is O(Me-nε) and convergence exponentially fast
( ) ( )∑=ℑ∈
≤⎥⎦⎤
⎢⎣⎡ >−
M
i
iemp
fCPfRfRP
1][][sup εε
( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC
10
VC Boundsthe problem is the case when F contains an infinite number of functions. one of the main VC results is the solution to this:• the probability of the empirical risk on a sample of n points
differing from the risk by more that ε can be bounded by• twice the probability that it differs from the empirical risk on a
second sample of size 2n by more than ε/2
Theorem: for nε2 > 2
where • the 1st P refers to sample of size n and the 2nd to that of size 2n. • in the latter case, Remp measures the loss on first half and R’emp the
loss on the second half
( ) ( ) ⎥⎦⎤
⎢⎣⎡ >−≤⎥⎦
⎤⎢⎣⎡ >−
ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp
femp
f
11
VC dimensionthis is intuitive:• if the Remps on two independent n-samples are close then they
should also be close to the true error rate
the practical significance is that• this makes F effectively finite
• when we restrict the functions to 2n points there are at most 22n
different elements in the set
• in practice the number could, of course, be smaller• the VC dimension is determined by this number
x1 x2 x2n
all these functions are thesame when restricted to the
sample points
12
VC dimensionto formalize this, we denote the 2n point sample by
and the cardinality of the set of distinct functions by
the maximum cardinality over all possible samples of size 2n is the shattering coefficient (or covering number) of F
it is a measure of the complexity of F, the number of ways in which it can separate the two classes
{ }),(,),,( 22112 nnn yxyxZ K=
( )nZN 2,ℑ
( )nN 2,ℑ
13
VC boundsbut let’s go back to proving consistencywe have seen that
to bound the right hand side we• pick a maximal set of functions {f1, ..., f N(F,Z2n) } that can be
distinguished based on Z2n
• use the union bound and apply the LLN bound to each term• however, because the fi depend on Z2n, we still do not have
independent errors• VC’s trick was to consider random permutations of the sample• denote Remp
σ[f] and R’empσ[f] the two halves after permutation
( ) ( ) ⎥⎦⎤
⎢⎣⎡ >−≤⎥⎦
⎤⎢⎣⎡ >−
ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp
femp
f(*)
14
VC bounds• since Z2n is iid, the permutation does not affect (*) and we can
simply consider
• we next express the event
• as
where the Cε(fk) refer to the individual functions
( )( ) ( ) nnZnempemp
ZfZ
empempf
dZZPZfRfRP
fRfRP
nn
n 222'
||
'
22
2|2/][][sup
2/][][sup
∫ ⎥⎦
⎤⎢⎣
⎡ >−
=⎥⎦⎤
⎢⎣⎡ >−
ℑ∈
ℑ∈
Xx{1,-1}
ε
ε
σσσ
σσσ
( )⎭⎬⎫
⎩⎨⎧ >−=
ℑ∈2/]['][sup|
2|εσ σσ
ε fRfRC empempZf n
( )⎭⎬⎫
⎩⎨⎧ >−==
ℑ∈
ℑ
=
2/]['][sup|)(),(2
2
|
),(
1
εσ σσεεε kempkemp
Zf
ZN
kkk fRfRfCfCC
n
n
U
15
VC bounds• note that because we are conditioning on Z2n, the functions fk can
now be considered fixed, and the errors iid• we can apply the LLN to each term to get
• and then use the union bound to get a bound on Cε
• from which
[ ] 82|
2
22|)(
ε
εσ
n
nkZ eZfCPn
−≤
[ ] 822|
2
22).,(|
ε
εσ
n
nnZ eZNZCPn
−ℑ≤
( ) [ ] ( )
( ) [ ] 8222
82
222|'
2
2
2
22
),(22],[
|2/][][sup
εε
εσσσ
σ ε
n
nnnZ
n
n
nnZnZempempf
eZNEdZZPeZN
dZZPZCPfRfRP
n
nn
−−
ℑ∈
ℑ=ℑ≤
=⎥⎦⎤
⎢⎣⎡ >−
∫
∫
Xx{1,-1}
Xx{1,-1}
16
VC bounds• we can now go back to (*) to get the bound
hence, as long as N(F,Z2n) does not grow exponentially with n, it is possible to bound the risk!!!making
we have that, with probability at least 1-δ
( ) [ ] 82
2
),(4][][supε
εn
nempf
eZNEfRfRP−
ℑ∈ℑ≤⎥⎦
⎤⎢⎣⎡ >−
[ ] [ ]⎟⎠⎞
⎜⎝⎛ ℑ
=⇔ℑ=−
δεδ
ε ),(4log8),(4 2282
2
nn
nZNE
neZNE
[ ]( ) ⎥⎦⎤
⎢⎣⎡ +ℑ+≤
δ4log),(log8][][ 2nemp ZNE
nfRfR
(**)
17
VC boundsis called the “annealed entropy”
it is difficult to evaluate since we do not have a distribution for Z2n
it is usually upper bounded by the sup over all samples
this is called the “growth function” and is the log of the shattering coefficient
if F is a rich as possible, then every Zn can be separated in 2n ways and
[ ]( )),(log 2nZNE ℑ
),(lnmax)( 2nZZNnG
n
ℑ=ℑ
),(ln)( nNnG ℑ=ℑ
2ln)( nnG =ℑ
18
VC dimensionnote that, in this case there is no convergence since (**) does not go to zeroVC showed that either• this is true for all n or• there is one n above which• this n is the VC dimension
• if n does not exist the VC dimension is infinite
in summary, a finite VC dimension is a sufficient condition for (exponentially fast) convergence to the true risk
2ln)( nnG <ℑ
ℑ∈
=ℑ
functions by ways 2 in separated
becanthat pointsnof #maximumn
)(VC
19
VC dimensionnote that if N(F, n) = 2n, all possible separations can be implemented by functions in Fin this case the functions in F are said to shatter n points
note:• this means that there are n points that can be separated in all
possible ways• does not mean that this applies to all sets of n points
example: • let F be the set of lines in R2. For 3 points in general position
• the set of lines shatters three points on R2!
x
o x
o
x o
o
o x
x
x o
x
x x
o
o ox
o o
o
x x
20
VC dimensionexample: • find a set of four points that is shattered (that is can be separated
in all possible ways) by a line in R2
• the following configuration is never possible (xor)
• the set of lines does not shatter four points on R2!
hence
example• the VC dimension of the set of hyperplanes in Rd is d+1
o
o x
x
ℑ∈=ℑ functions by shattered be canthat pointsof #max )(VC
21
VC dimensionVC have also characterized what happens to the growth function beyond the VC dimension
in particular, they have show that for n > VC(F)
i.e.• up to n the growth function
is linear on n• after that it increases
logarithmically, i.e. much more slowly
⎥⎦
⎤⎢⎣
⎡+⎟⎟⎠
⎞⎜⎜⎝
⎛ℑ
ℑ≤ℑ 1)(
log)()(VC
nVCnG
n
)(nG ℑ
VC(F)
VC(F)
22
VC dimensionthe bound on the risk can be expressed in terms of VC(F)
it is easy to show that the second term is monotonically increasing with VC(F)
• e.g. for n = 10,000, δ = 0.05
• note that the boundcan be quite loose
• there is a lot of morerecent work
δ4log81
)(log)(8][][
nVCn
nVCfRfR emp +⎥
⎦
⎤⎢⎣
⎡+⎟⎟⎠
⎞⎜⎜⎝
⎛ℑ
ℑ+≤
23
The connection to the marginin summary• a finite VC dimension is a sufficient condition for convergence to
the true risk• the smaller the VC dimension, the smaller the sample we need to
have the same convergence guarantees• but we are already using hyperplanes, the VC dimension is linear
in the dimension of the space• can we do better than this?
• so far, we have not consideredhow to choose the hyperplane
• we can get an additional improvement when we considerthe margin
• we proved this last time
24
The connection to the marginTheorem: consider hyperplanes of the form wTx=0, where w is normalized wrt a sample D = {(x1,y1),...,(xn,yn)} in the usual way
Then, the set of functions
defined on X* = {x1, ..., xn} and satisfying
has VC dimension such that
where R is the radius of the smallest sphere centered at the origin and containing X*.
1min =iT
ixw
( ){ }xw Tsgn=ℑ
λ≤w
22)( λRVC ≤ℑ
25
The role of the marginnotes:• the theorem basically says that if
then
• when we maximize the margin (minimize ||w||) we are decreasing the value of λ and, therefore,
• decreasing the upper bound on the VC dimension• note that, as long as ||w|| is finite, the VC dimension is finite even if
the dimension of the space is infinite• since γ = 1/||w||, this is always true if the margin is strictly greater
than zero
in summary, we can decrease the VC dimension below the d+1 limit on Rd
λ≤w
22)( λRVC ≤ℑ
26
Structural risk minimizationthis suggests an alternative principle to ERM, which is called SRMnote that the 2nd term in the bound does not depend on fitself but on the class of functions FSRM advocates imposing some structure on F, and minimizing over the elements of the structurethe structure is obtained by decomposing the set F into a collection of nested subsets of increasing capacity (and size)e.g. polynomials of increasing orderfunctions on larger sets achieve smaller Remp but have higher VC dimension and are more penalized by 2nd term
27
Structural risk minimizationwe have something like this
bound on risk
capacity
empirical risk
VC(F)f*
⊂ ⊂Sn-1 Sn Sn+1
increasing VC dim
28
In summarythe SRM principle:• start from a nested collection of families of functions
where Si = {hi(x,α), for all α}• for each Si, find the function (set of parameters) that minimizes
the empirical risk
• select the function class such that
kSS ⊂⊂L1
∑=
=n
kkik
iemp xhyL
nR
1)],(,[1min α
α
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
+⎥⎦
⎤⎢⎣
⎡+⎟⎟⎠
⎞⎜⎜⎝
⎛+=
δ4log81
)(log)(8min
nSVCn
nSVCRR
i
iemp
i
i
29
In practicethis leads to the methods that we have studiedSVMs:• empirical risk is determined by the number
of outliers ξi
• margin is maximized when we minimize||w||
• this leads to the problem
1/||w*||
1/||w*||
x
ξi / ||w*||xi
( )i
ibxwy
w
i
iiT
i
iibw
∀≥∀−≥+
+∑
,01
min 2
,,
ξξ
ξξ
tosubject
30
In practiceboosting:• does greedy descent on the
loss
• this is inversely proportional tothe margin
• leads to a maximal margin solution• and the algorithm explicitly minimizes
training error, which is the empiricalrisk
0 1
“0-1”
SVM
yg(x)
1
Boosting( ) ( ))(exp, xygxyL −=
31