structural risk minimization - svcl• final paper: 70 % (50 % for content, 20 % for writing) •...

Structural risk minimization

Nuno Vasconcelos ECE Department, UCSD

2

Course projectsrecall: project deliverables are:• project report due Tuesday, June 12• bring to class on presentation day, final, no exceptions!• June 12, 1:00-5:00 PM, WLH 2114

we have 4h and 15 presentations• will make it 15 min/each• bring your talk on a mini-disk or email before

grading:• project: 70% of total grade• project grading:

• final paper: 70 % (50 % for content, 20 % for writing)• presentation 30 %

3

Margins, VC dimension, and SRMwe have been talking about techniques that achieve good generalization by maximizing the marginit turns out that the quantity of interest is the so-called VC dimension of the family of functions implemented by the learning machinethe margin allows us to control this dimension and this is what makes it important structural risk minimization tries to achieve the optimal balance between • empirical risk and • a generalization bound that depends on the VC dimension

4

Loss functions and Riskgoal of the learning machine: to find the set of parameters that minimizes the risk (expected value of the loss)

in practice it is impossible to evaluate the risk, because we do not know what PX,Y(x,y) is.all we have is a training set

we estimate the risk by the empirical risk on this training set

{ } dxdyxfyLyxPxfyLEfR YXYX )](,[),()](,[)( ,, ∫==

( ) ( ){ }nn yxyxD ,,,, 11 K=

∑=

=n

iiiemp xfyL

nR

1)](,[1)( α

5

The law of large numbersTheorem: if xi are iid such that for all ε > 0 we have

noting that

the theorem seems to indicate that

i.e. empirical risk converges to the risk exponentially fastthis is the best that one could hope for, there seems to be no reason to use anything other than ERM

{ } 2

2

)(2][ abn

nn eSESP −−

≤≥−ε

ε

∑=∈i

ini xn

Sbax 1 and ],[

ni

iemp Sn

fR == ∑ξ1][ ][][][ nSEEfR == ξ

( ) nemp efRfRP

2

][][ εε −≤≥−

6

Empirical risk vs risksince we know that ERM is not that good, something must be wrongthe problem is that the bounds assume independent errors ξi

since we are choosing f (by ERM) so that the mean of ξiis as small as possible, this is not the casewe need to look for alternative ways to understand the relationship between the two risksfor them to be equivalent we need the ERM solution f* to converge to the lowest value of R[f]this turns out not to be possible unless we restrict F

7

Consistency of ERMit turns out that there is a simple condition under which ERM converges to the minimum riskTheorem: (VC) The condition

is necessary and sufficient for consistency of ERMthis shows that consistency depends on the class of functions F, but is not terribly useful in practicewe next look at properties of F that ensure convergence

the first thing to do is to try to bound this quantity

( ) 0][][suplim =⎥⎦⎤

⎢⎣⎡ >−

ℑ∈∞→εfRfRP emp

fn

8

VC Boundsthe probability

is easy to bound when F is finitelet F be the set F = {f1, ..., fM} and

the set of samples for which the risks obtained with the ithfunction differ by more than εthen, if M=2,

( ) ⎥⎦⎤

⎢⎣⎡ >−

ℑ∈ε][][sup fRfRP emp

f

( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC

( ) ( )( ) ( ) ( ) ( ) ( )211121

21][][sup

εεεεεε

εεε

CPCPCCPCPCP

CCPfRfRP empf

+≤∩−+=

∪=⎥⎦⎤

⎢⎣⎡ >−

ℑ∈

9

VC Boundsin general,

this is called the union boundrecalling that

and noting that the fi are fixed, from which the errors are independent, we can now • just apply the LLN to each of the P(Cε

i)• each of them is bounded by O(e-nε)• the overall bound is O(Me-nε) and convergence exponentially fast

( ) ( )∑=ℑ∈

≤⎥⎦⎤

⎢⎣⎡ >−

M

i

iemp

fCPfRfRP

1][][sup εε

( ){ }εε >−= ][][|),(),...,,( 11 iempinni fRfRyxyxC

10

VC Boundsthe problem is the case when F contains an infinite number of functions. one of the main VC results is the solution to this:• the probability of the empirical risk on a sample of n points

differing from the risk by more that ε can be bounded by• twice the probability that it differs from the empirical risk on a

second sample of size 2n by more than ε/2

Theorem: for nε2 > 2

where • the 1st P refers to sample of size n and the 2nd to that of size 2n. • in the latter case, Remp measures the loss on first half and R’emp the

loss on the second half

( ) ( ) ⎥⎦⎤

⎢⎣⎡ >−≤⎥⎦

⎤⎢⎣⎡ >−

ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp

femp

f

11

VC dimensionthis is intuitive:• if the Remps on two independent n-samples are close then they

should also be close to the true error rate

the practical significance is that• this makes F effectively finite

• when we restrict the functions to 2n points there are at most 22n

different elements in the set

• in practice the number could, of course, be smaller• the VC dimension is determined by this number

x1 x2 x2n

all these functions are thesame when restricted to the

sample points

12

VC dimensionto formalize this, we denote the 2n point sample by

and the cardinality of the set of distinct functions by

the maximum cardinality over all possible samples of size 2n is the shattering coefficient (or covering number) of F

it is a measure of the complexity of F, the number of ways in which it can separate the two classes

{ }),(,),,( 22112 nnn yxyxZ K=

( )nZN 2,ℑ

( )nN 2,ℑ

13

VC boundsbut let’s go back to proving consistencywe have seen that

to bound the right hand side we• pick a maximal set of functions {f1, ..., f N(F,Z2n) } that can be

distinguished based on Z2n

• use the union bound and apply the LLN bound to each term• however, because the fi depend on Z2n, we still do not have

independent errors• VC’s trick was to consider random permutations of the sample• denote Remp

σ[f] and R’empσ[f] the two halves after permutation

( ) ( ) ⎥⎦⎤

⎢⎣⎡ >−≤⎥⎦

⎤⎢⎣⎡ >−

ℑ∈ℑ∈2/][][sup2][][sup ' εε fRfRPfRfRP empemp

femp

f(*)

14

VC bounds• since Z2n is iid, the permutation does not affect (*) and we can

simply consider

• we next express the event

• as

where the Cε(fk) refer to the individual functions

( )( ) ( ) nnZnempemp

ZfZ

empempf

dZZPZfRfRP

fRfRP

nn

n 222'

||

'

22

2|2/][][sup

2/][][sup

∫ ⎥⎦

⎤⎢⎣

⎡ >−

=⎥⎦⎤

⎢⎣⎡ >−

ℑ∈

ℑ∈

Xx{1,-1}

ε

ε

σσσ

σσσ

( )⎭⎬⎫

⎩⎨⎧ >−=

ℑ∈2/]['][sup|

2|εσ σσ

ε fRfRC empempZf n

( )⎭⎬⎫

⎩⎨⎧ >−==

ℑ∈

ℑ

=

2/]['][sup|)(),(2

2

|

),(

1

εσ σσεεε kempkemp

Zf

ZN

kkk fRfRfCfCC

n

n

U

15

VC bounds• note that because we are conditioning on Z2n, the functions fk can

now be considered fixed, and the errors iid• we can apply the LLN to each term to get

• and then use the union bound to get a bound on Cε

• from which

[ ] 82|

2

22|)(

ε

εσ

n

nkZ eZfCPn

−≤

[ ] 822|

2

22).,(|

ε

εσ

n

nnZ eZNZCPn

−ℑ≤

( ) [ ] ( )

( ) [ ] 8222

82

222|'

2

2

2

22

),(22],[

|2/][][sup

εε

εσσσ

σ ε

n

nnnZ

n

n

nnZnZempempf

eZNEdZZPeZN

dZZPZCPfRfRP

n

nn

−−

ℑ∈

ℑ=ℑ≤

=⎥⎦⎤

⎢⎣⎡ >−

∫

∫

Xx{1,-1}

Xx{1,-1}

16

VC bounds• we can now go back to (*) to get the bound

hence, as long as N(F,Z2n) does not grow exponentially with n, it is possible to bound the risk!!!making

we have that, with probability at least 1-δ

( ) [ ] 82

2

),(4][][supε

εn

nempf

eZNEfRfRP−

ℑ∈ℑ≤⎥⎦

⎤⎢⎣⎡ >−

[ ] [ ]⎟⎠⎞

⎜⎝⎛ ℑ

=⇔ℑ=−

δεδ

ε ),(4log8),(4 2282

2

nn

nZNE

neZNE

[ ]( ) ⎥⎦⎤

⎢⎣⎡ +ℑ+≤

δ4log),(log8][][ 2nemp ZNE

nfRfR

(**)

17

VC boundsis called the “annealed entropy”

it is difficult to evaluate since we do not have a distribution for Z2n

it is usually upper bounded by the sup over all samples

this is called the “growth function” and is the log of the shattering coefficient

if F is a rich as possible, then every Zn can be separated in 2n ways and

[ ]( )),(log 2nZNE ℑ

),(lnmax)( 2nZZNnG

n

ℑ=ℑ

),(ln)( nNnG ℑ=ℑ

2ln)( nnG =ℑ

18

VC dimensionnote that, in this case there is no convergence since (**) does not go to zeroVC showed that either• this is true for all n or• there is one n above which• this n is the VC dimension

• if n does not exist the VC dimension is infinite

in summary, a finite VC dimension is a sufficient condition for (exponentially fast) convergence to the true risk

2ln)( nnG <ℑ

ℑ∈

=ℑ

functions by ways 2 in separated

becanthat pointsnof #maximumn

)(VC

19

VC dimensionnote that if N(F, n) = 2n, all possible separations can be implemented by functions in Fin this case the functions in F are said to shatter n points

note:• this means that there are n points that can be separated in all

possible ways• does not mean that this applies to all sets of n points

example: • let F be the set of lines in R2. For 3 points in general position

• the set of lines shatters three points on R2!

x

o x

o

x o

o

o x

x

x o

x

x x

o

o ox

o o

o

x x

20

VC dimensionexample: • find a set of four points that is shattered (that is can be separated

in all possible ways) by a line in R2

• the following configuration is never possible (xor)

• the set of lines does not shatter four points on R2!

hence

example• the VC dimension of the set of hyperplanes in Rd is d+1

o

o x

x

ℑ∈=ℑ functions by shattered be canthat pointsof #max )(VC

21

VC dimensionVC have also characterized what happens to the growth function beyond the VC dimension

in particular, they have show that for n > VC(F)

i.e.• up to n the growth function

is linear on n• after that it increases

logarithmically, i.e. much more slowly

⎥⎦

⎤⎢⎣

⎡+⎟⎟⎠

⎞⎜⎜⎝

⎛ℑ

ℑ≤ℑ 1)(

log)()(VC

nVCnG

n

)(nG ℑ

VC(F)

VC(F)

22

VC dimensionthe bound on the risk can be expressed in terms of VC(F)

it is easy to show that the second term is monotonically increasing with VC(F)

• e.g. for n = 10,000, δ = 0.05

• note that the boundcan be quite loose

• there is a lot of morerecent work

δ4log81

)(log)(8][][

nVCn

nVCfRfR emp +⎥

⎦

⎤⎢⎣

⎡+⎟⎟⎠

⎞⎜⎜⎝

⎛ℑ

ℑ+≤

23

The connection to the marginin summary• a finite VC dimension is a sufficient condition for convergence to

the true risk• the smaller the VC dimension, the smaller the sample we need to

have the same convergence guarantees• but we are already using hyperplanes, the VC dimension is linear

in the dimension of the space• can we do better than this?

• so far, we have not consideredhow to choose the hyperplane

• we can get an additional improvement when we considerthe margin

• we proved this last time

24

The connection to the marginTheorem: consider hyperplanes of the form wTx=0, where w is normalized wrt a sample D = {(x1,y1),...,(xn,yn)} in the usual way

Then, the set of functions

defined on X* = {x1, ..., xn} and satisfying

has VC dimension such that

where R is the radius of the smallest sphere centered at the origin and containing X*.

1min =iT

ixw

( ){ }xw Tsgn=ℑ

λ≤w

22)( λRVC ≤ℑ

25

The role of the marginnotes:• the theorem basically says that if

then

• when we maximize the margin (minimize ||w||) we are decreasing the value of λ and, therefore,

• decreasing the upper bound on the VC dimension• note that, as long as ||w|| is finite, the VC dimension is finite even if

the dimension of the space is infinite• since γ = 1/||w||, this is always true if the margin is strictly greater

than zero

in summary, we can decrease the VC dimension below the d+1 limit on Rd

λ≤w

22)( λRVC ≤ℑ

26

Structural risk minimizationthis suggests an alternative principle to ERM, which is called SRMnote that the 2nd term in the bound does not depend on fitself but on the class of functions FSRM advocates imposing some structure on F, and minimizing over the elements of the structurethe structure is obtained by decomposing the set F into a collection of nested subsets of increasing capacity (and size)e.g. polynomials of increasing orderfunctions on larger sets achieve smaller Remp but have higher VC dimension and are more penalized by 2nd term

27

Structural risk minimizationwe have something like this

bound on risk

capacity

empirical risk

VC(F)f*

⊂ ⊂Sn-1 Sn Sn+1

increasing VC dim

28

In summarythe SRM principle:• start from a nested collection of families of functions

where Si = {hi(x,α), for all α}• for each Si, find the function (set of parameters) that minimizes

the empirical risk

• select the function class such that

kSS ⊂⊂L1

∑=

=n

kkik

iemp xhyL

nR

1)],(,[1min α

α

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

+⎥⎦

⎤⎢⎣

⎡+⎟⎟⎠

⎞⎜⎜⎝

⎛+=

δ4log81

)(log)(8min

nSVCn

nSVCRR

i

iemp

i

i

29

In practicethis leads to the methods that we have studiedSVMs:• empirical risk is determined by the number

of outliers ξi

• margin is maximized when we minimize||w||

• this leads to the problem

1/||w*||

1/||w*||

x

ξi / ||w*||xi

( )i

ibxwy

w

i

iiT

i

iibw

∀≥∀−≥+

+∑

,01

min 2

,,

ξξ

ξξ

tosubject

30

In practiceboosting:• does greedy descent on the

loss

• this is inversely proportional tothe margin

• leads to a maximal margin solution• and the algorithm explicitly minimizes

training error, which is the empiricalrisk

0 1

“0-1”

SVM

yg(x)

1

Boosting( ) ( ))(exp, xygxyL −=

structural risk minimization - svcl• final paper: 70 % (50 % for content, 20 % for writing) •...

Documents