(summary) an experimental and theoretical comparison of model selection methods

11
An Experimental and Theoretical Comparison of Model Selection Methods Michael Kearns AT&T Bell Laboratories Murray Hill, New Jersey Yish ay Mansour Tel Aviv University Tel Aviv, Israel Andrew Y. Ng Carnegie Mellon University Pittsburgh, Pennsylvania Dana Ron Hebrew University Jerusalem, Israel 1 Intr odu ct ion In the model selection problem, we must balance the com- plexity of a statistical model with its goodness of t to the traini ng data . This proble m aris es r epea tedly in stati stica l es- timation, mac hine learning, and scientic inquiry in general. Instances of the model selection problem include choosing the best number of hidden nodes in a neural network, de- termining the right amount of pruning to be performed on a decision tree, and choosing the degree of a polynomial t to a set of points . In eac h of these cases, the goal is not to minimize the error on the training data, but to minimize the resulting generalization error . Many model selection algorithms have been proposed in the litera tureof sev eral differe nt rese arch communiti es, too man y to productively surv ey here. (A more detailed history of the prob lemwill be giv en in the ful l pap er .) Per hap s sur pri singly , despite the many proposed s olutions for model selection and the diverse methods of analysis, direct comparisons between the different proposals (either experimental or theoretical) are rare. The goal of this pap er is to pro vide such a compar iso n, and more importantly , to describe the general conclusions to which it has led. Relyin g on evi denc e that is divide d betwe en controlled experimental results and related formal analysis, we compare three well-known model selection algorithms. We attempt to identify their relative and absolute strengths and weaknesses, and we provide some general methods for analyzing the behavior and performance of model selection algorithms. Our hope is that these res ults ma y aid the in- formed practitioner in making an educated choice of model This research was done while Y. Mansour, A. Ng and D. Ron were visiting AT&T Bell Laboratories. Supported in part by The Israel Science Foundation, adminis- tered by The Israel Academy of Science and Humanities, and by a grant of the Israeli Ministry of Science and Technol ogy. Supported by the Eshkol Fellowship, sponsored by the Israeli Ministry of Science. selection algorithm (perhaps based in part on some known prope rties of the mode l selection proble m being confr onted ). Thesummary of the pap er fol lows. In Sec tion2, we pro videa formaliz ation of the mode l sele ction proble m. I n this formal- ization, we isolate the problem of choosing the appropriate complexity for a hypothesis or model. We also introduce the specic model selection problem that will be the basis for our experimental results, and describe an initial experiment demons tra ting tha t the proble m is nontri via l. In Section 3, we introduce the three model selection algorithms we examine in the experime nts: V apnik’s Guarante ed Risk Minimization (GRM) [11], an insta ntiat ion of Rissa nen’ s Minimum De- scription Length Principle (MDL) [7], and Cross Validation (CV). Section 4 describes our controlled experimental comparison of the threealgorithms . Usingarticia lly gene rate d data from a known target function allows us to plot complete learning curves for the three algorithms over a wide range of sample sizes, and to direc tly comp are the resulti ng generalizationer- ror to the hypothesis complexity selected by each algorithm. It also all owsus to in ve sti gat e the ef fects of varying other nat - ura l par ameters of the proble m, suc h as theamou nt of nois e in the data . Thes e expe rime nts suppor t the follo wingassertio ns: the behavior of the algorithms examined can be comple x and incomparable, even on simple problems, and there are fun- damental difculties in identifying a “best” algorithm; there is a strong connection between hypothesis complexity and generalization error; and it may be impossible to uniformly improve the performance of the algorithms by slight mod- ications (such as introducing constant multipliers on the complexity penalty terms). In Sections 5, 6 and 7 we turn our efforts to formal results providing explanation and support f or the experimenta l nd- ings. We begin in Sec tion 5 by upper bounding the error of an y model se lec tionalgor ithmfall ingintoa wideclass (c all ed  penalty-based alg orit hms ) tha t inc lude s bothGRM and MDL (but not cross validation). The form of this bound highlights the competing desires for powe rful hypothe ses and controlled complexity. In Section 6, we upper bound the additional er- ror suf fere d by cross validationcompare d to any other model sele ction algor ithm. This qualit y of this bound depends on the extent to which the function classes hav e learning curves obeying a classical power law. Finally, in Section 7, we give an impossibility result demonstrating a fundamental handi-

Upload: eduardo-antonio-buitrago-zapata

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 1/10

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 2/10

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 3/10

width 1 = 100 and alternating label. In Figure 1 we plot d

(which we can calculate exactly, since we have chosen thetarget function) when S consists of m = 2000 random ex-amples (drawn fromthe uniform inputdistribution)corruptedby noise at the rate = 0 : 2. For our current discussion itsufces to note that d does indeed experience a nontrivialminimum. Not surprisingly, this minimum occurs near (butnot exactly at) the target complexity of 100.

3 Three Algorithms for Model SelectionThe rst two model selection algorithms we consider aremembers of a general class that we shall informally refer to as penalty-based algorithms(and shall formally dene shortly).The common theme behind these algorithms is their attemptto construct an approximation to d solely on the basis of the training error ˆ d and the complexity d , often by tryingto “correct” ˆ d by the amount that it underestimates d

through the addition of a “complexity penalty” term.

In Vapnik’s Guaranteed Risk Minimization (GRM) [11], ˜d is

chosen according to the rule 1

˜d =

argmin d

f

ˆ d + d = m

1+

p

1+

ˆ d m = d g

1where for convenience but without loss of generality we haveassumed that d is the Vapnik-Chervonenkis dimension [11,12] of the class F

d

; this assumption holds in the intervalsmodel selection problem. The origin of this rule can besummarized as follows: it has been shown [11] (ignoringlogarithmic factors) that for every d and for every h 2 F

d

,p

d = m is an upper bound on j ˆ h , h j and hence j ˆ d ,

d j

p

d = m . Thus, by simply addingp

d = m to ˆ d , weensure that the resulting sum upper bounds d , and if weare optimistic we might further hope that the sum is in facta close approximation to d , and that its minimization istherefore tantamount to the minimization of d . The actual

rule given in Equation (1) is slightly more complex than this,and reects a rened bound on j ˆ d , d j that varies fromd = m for ˆ d close to 0 to

p

d = m otherwise.

The next algorithm we consider, the Minimum Description Length Principle (MDL) [5, 6, 7, 1, 4] has ratherdifferent ori-gins than GRM. MDL is actually a broad class of algorithmswith a common information-theoreticmotivation, each algo-rithm determined by the choice of a specic coding schemefor both functionsand their training errors; thistwo-part codeis then used to describe the training sample S . To illustratethe method, we give a coding scheme for the intervals modelselection problem 2 . Let h be a function with exactly d alter-nations of label (thus, h 2 F

d

). To describe the behavior of h

on the sampleS = f h x

i

; b

i

i g

, we can simply specify thed inputs where h switches value (that is, the indices i such

1Vapnik’s original GRM actually multiplies the second terminside theargmin f g above by a logarithmicfactor intendedto guardagainst worst-case choices from V S d . Since this factor rendersGRM uncompetitive on the ensuing experiments, we consider thismodied and quite competitive rule whose spirit is the same.

2Our goal here is simply to give one reasonable instantiationof MDL. Other coding schemes are obviously possible; however,several of our formal results will hold for essentially all MDLinstantiations.

that h x

i

6= h x

i + 1 ) 3 . This takes log,

m

d

bits; dividing bym to normalize, we obtain 1 = m log

,

m

d

H d = m [2],where H is the binary entropy function. Now given h ,the labels in S can be described simply by coding the mis-takes of h (that is, those indices i where h x

i

6= f x

i

),at a normalized cost of H ˆ h . Technically, in the codingscheme just described we also need to specify the values of d and ˆ h m , but the cost of these is negligible. Thus, the

version of MDL that we shall examine for theintervalsmodelselection problem dictates the following choice of ˜

d :˜

d = argmind

f H ˆ d + H d = m g : 2

Inthe context of model selection, GRMand MDL can bothbeinterpreted as attempts to model d by transforming ˆ d

and d . More formally, a model selection algorithm of theform ˜

d = argmind

f G ˆ d ; d = m g 3

shall be called a penalty-based algorithm 4 . Notice that anideal penalty-based algorithm would obey G ˆ d ; d = m

d (or at least G ˆ d ; d = m and d would be minimizedby the same value of d ).

The third model selection algorithm that we examine has adifferent spirit than the penalty-based algorithms. In crossvalidation (CV) [9, 10], we use only a fraction 1 , of the examples in S to obtain the hypothesis sequence ˜

h 1 2

F 1 ; : : : ;

˜h

d

2 F

d

; : : : — that is, ˜h

d

is now L S

0

; d , where S

0

consists of the rst 1 , m examples in S . Here 2 0 ; 1is a parameter of the CV algorithm whose tuning we discussbriey later. CV chooses ˜

d according to the rule˜

d = argmind

f ˆ

S

0 0

˜h

d

g 4

where ˆ

S

0 0

˜h

d

is the error of ˜h

d

on S

0 0 , the last m examplesof S thatwerewithheld in selecting ˜

h

d

. Notice that forCV, weexpect the quantity d =

˜h

d

to be (perhapsconsiderably)larger than in the case of GRM and MDL, because now ˜

h

d

was chosen on the basis of only 1 , m examples ratherthan all m examples. For thisreason we wish to introducethemore general notation

d

˜h

d

to indicate the fractionof the sample withheld from training. CV settles for

d

instead of

0 d in order to have an independent test set with

which to directly estimate

d .

4 A Controlled Experimental ComparisonOur results begin with a comparison of the performance andproperties of the three model selection algorithms in a care-fully controlled experimental setting — namely, the intervalsmodel selection problem. Among the advantages of suchcontrolled experiments, at least in comparison to empiricalresults on data of unknown origin, are our ability to exactlymeasure generalization error (since we know the target func-tion and the distribution generating the data), and our ability

3In the full paper we justify our use of the sample points todescribe h ; it is quite similar to representing h using a grid of resolution 1 = p m for some polynomial p .

4With appropriately modied assumptions, all of the formal re-sults in the paper hold for the more general form G ˆ d ; d ; m ,where we decouple the dependence on d and m . However, thesimpler coupled form will sufce for our purposes.

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 4/10

to preciselystudy theeffectsof varying parameters of thedata(such as noise rate, target function complexity, and samplesize), on the performance of model selection algorithms. Theexperimental behavior we observe foreshadows a number of important themes that we shall revisit in our formal results.

We begin with Figure2. To obtain this gure, a training sam-ple was generated from the uniform input distribution andlabeled according to an intervals function over 0 ; 1 consist-ing of 100 intervals of alternating label and equal width 5; thesample was corrupted with noiseat rate = 0 : 2. In Figure 2,we have plotted the true generalization errors (measured withrespect to the noise-free source of examples) GRM , MDL and CV (using test fraction = 0 : 1 for CV) of the hypothesesselected from the sequence ˜

h 1 ; : : : ;

˜h

d

; : : : by each the threealgorithms as a function of sample size m , which rangedfrom 1 to 3000 examples. As described in Section 2, the hy-potheses ˜

h

d

were obtained by minimizing the training errorwithin each class F

d

. Details of the code used to performthese experiments will be provided in the full paper.

Figure 2 demonstrates the subtlety involved in comparingthe three algorithms: in particular, we see that none of the

three algorithms outperforms the others for all sample sizes.Thus we can immediately dismiss the notion that one of the algorithms examined can be said to be optimal for thisproblem in any standard sense. Getting into the details, wesee that there is an initial regime (for m from1 to slightly lessthan 1000) in which MDL is the lowest of the three errors,sometimes outperforming GRM by a considerable margin.Then there is a second regime (for m about 1000 to about2500) where an interesting reversal of relative performanceoccurs, since now GRM is the lowest error, considerablyoutperforming MDL , which has temporarily leveled off. Inboth of these rst two regimes, CV remains the intermediateperformer. In the third and nal regime, MDL decreasesrapidly to match GRM and the slightly larger CV , and the

performance of all three algorithms remains quite similar forall larger sample sizes.

Insight into the causes of Figure 2 is given by Figure 3,where for the same runs used to obtain Figure 2, we insteadplot the quantities ˜

d GRM , ˜d MDL and ˜

d CV , the value of ˜d cho-

sen by GRM, MDL and CV respectively (thus, the “correct”value, in the sense of simply having the same number of intervals as the target function, is 100). Here we see thatfor small sample sizes, corresponding to the rst regime dis-cussed for Figure 2 above, ˜

d GRM is slowly approaching 100from below, reaching and remaining at the target value forabout m = 1500. Although we have not shown it explic-itly, GRM is incurring nonzero training error throughout theentire range of m . In comparison, for a long initial period(corresponding to the rst two regimes of m ), MDL is simplychoosing the shortest hypothesis that incurs no training error(and thus encodes both “legitimate” intervals and noise), andconsequently ˜

d MDL grows in an uncontrolled fashion. Moreprecisely, it can be shown that during this period ˜

d MDL isobeying ˜

d MDL d 0 2 1 , m + 1 , 2

2s , where

s is the number of (equally spaced) intervals in the targetfunction and is the noise rate (so for the current experiment

5Similar results hold for a randomly chosen target function.

s = 100 and = 0 : 2). This “overcoding” behavior of MDLis actually preferable, in terms of generalization error, to theinitial “undercoding” behavior of GRM, as veried by Fig-ure 2. Once ˜

d GRM approaches 100, however, the overcodingof MDL is a relative liability, resulting in the second regime.Figure 3 clearly shows that the transition from the secondto the third regime (where approximate parity is achieved)is the direct result of a dramatic correction to ˜

d MDL fromd

0 (dened above) to the target value of 100. Finally,˜

d

CVmakes a more rapid but noisier approach to 100 than ˜d GRM ,

and in fact also overshoots 100, but much less dramaticallythan ˜

d MDL . This more rapid initial increase again results insuperior generalization error compared to GRM for small m ,but the inability of ˜

d CV to settle at 100 results in slightlyhigher error for larger m . In the full paper, we examine thesame plots of generalization error and hypothesis complexityfor different values of the noise rate; here it must sufce tosay that for = 0, all three algorithms have comparable per-formance for all sample sizes, and as increases so do thequalitative effects discussed here for the = 0 : 2 case (forinstance, the duration of the second regime, where MDL isvastly inferior, increases with the noise rate).

The behavior ˜d GRM and ˜d MDL in Figure 3 can be traced to theform of the total penalty functions for the two algorithms.For instance, in Figures 4, and 5, we plot the total MDLpenalty H ˆ d + H d = m as a function of d for the xedsample sizes m = 2000 and 4000 respectively, again usingnoise rate = 0 : 20. At m = 2000, we see that the totalpenalty has its global minimum at approximately 650, whichis roughly the zero training error value d 0 discussed above(we are still in the MDL overcoding regime at this samplesize; see Figures 2 and 3). However, by this sample size,a signicant local minimum has developed near the targetvalue of d = 100. At m = 4000, this local minimumat d = 100 has become the global minimum. The rapidtransition of ˜

d MDL that marks the start of the nal regimeof generalization error is thus explained by the switchingof the global total penalty minimum from d 0 to 100. InFigures 6, we plot the total GRM penalty, just for the samplesize m = 2000. The behavior of the GRM penalty is muchmore controlled — for each sample size, the total penaltyhas a single-minimum bowl shape, with the minimum lyingto the left of d = 100 for small sample sizes and graduallymoving over d = 100 and sharpening there for large m ; asFigure 6 shows, the minimum already lies at d = 100 bym = 2000, as conrmed by Figure 3.

A natural question to pose after examining Figures 2 and 3is the following: is there a penalty-based algorithm that en- joys the best properties of both GRM and MDL? By this

we would mean an algorithm that approaches the “correct”d value (whatever it may be for the problem in hand) morerapidly than GRM, but does so without suffering the long,uncontrolled “overcoding” period of MDL. An obvious can-didate for such an algorithm is simply a modied version of GRM or MDL, in which we reason (for example) that per-haps the GRM penalty for complexity is too large for thisproblem (resulting in the initial reluctance to code), and wethus multiply the complexity penalty term in the GRM rule(the second term inside the argmin f g ) in Equation (1) by a

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 5/10

constant less than 1 (or analogously, multiply the MDL com-plexity penalty term by a constant greater than 1 to reduceovercoding). The results of an experiment on such a mod-ied version of GRM are shown in Figures 7 and 8, wherethe original GRM performance is compared to a modiedversion in which the complexity penalty is multiplied by 0.5.Interestingly and perhaps unfortunately, we see that there isno free lunch: while the modied version does indeed code

more rapidly and thus reduce the smallm

generalization er-ror, this comes at the cost of a subsequent overcoding regimewith a corresponding degradation in generalizationerror (andin fact a considerably slower return to d = 100 than MDLunder the same conditions) 6 . The reverse phenomenon (re-luctance to code) is experienced for MDL with an increasedcomplexity penalty multiplier (details in the full paper).

Let us summarize the key points demonstrated by these ex-periments. First, none of the three algorithms dominates theothers for all sample sizes. Second, the two penalty-basedalgorithms seem to have a bias either towards or against cod-ing that is overcome by the inherent properties of the dataasymptotically, but that can have a large effect on generaliza-tion error for small to moderate sample sizes. Third, this biascannot be overcome simply by adjusting the relative weightof error and complexity penalties, without reversing the biasof the resulting rule and suffering increased generalizationerror for some range of m . Fourth, while CV is not the bestof the algorithms for any value of m , it does manage to fairlyclosely track the best penalty-based algorithm for each valueof m , and considerably beats both GRM and MDL in theirregimes of weakness. We now turn our attention to our for-mal results, where each of these key points will be developedfurther.

5 A Bound on Generalization Error forPenalty-Based Algorithms

We begin our formal results with a bound on the general-ization error for penalty-based algorithms that enjoys threefeatures. First, it is general: it applies to practically anypenalty-based algorithm, and holds for any model selectionproblem (of course, there is a price to pay for such generality,as discussed below). Second, for certain algorithms and cer-tain problems the bound can give rapid rates of convergenceto small error. Third, the form of the bound is suggestiveof some of the behavior seen in the experimental results.We state the bound for the special but natural case in whichthe underlying learning algorithm L is training error mini-mization; in the full paper, we will present a straightforwardanalogue for more general L . Both this theorem and Theo-

rem 2 in the following section are stated for the noise-freecase; but again, straightforward generalizations to the noisycase will be included in the full paper.

Theorem 1 Let f F

d

g ; f ; D ; L be an instance of the modelselection problem in which L performs training error mini-mization, and assume for convenience that d is theVC dimen-

6Similar results are obtained in experiments in which every oc-currence of d in the GRM rule is replaced by an “effective dimen-sion” c 0 d for any constant c 0 1.

sion of F

d

. Let G : 0 ; 1 ! be a function that is con-tinuous and increasing in both its arguments, and let

G

m

denote the expected generalizationerror of the penalty-based model selection algorithm ˜

d = argmind

f G ˆ d ; d = m g ona training sample of size m . Then 7

G

m R

G

m +

q

˜d = m 5

where R

G

m approaches mind

f opt d g (which is the best generalization error achievable in any of the classes F

d ) as

m ! 1 . The rate of this approach willdepend on propertiesof G .

Proof: For any value of d , we have the inequalityG

,

ˆ

˜d ;

˜d = m

G

,

ˆ d ; d = m

: 6

because ˜d is chosen to minimize G ˆ d ; d = m . Using the

uniform convergence bound j h , ˆ h j

p

d = m for allh 2 F

d

and the fact that G ; is increasing in its rstargument, we can replace the occurrence of ˆ

˜d on the left-

hand sideof Equation (6) by

˜d ,

q

˜d = m to obtain a smaller

quantity, and we can replace the occurrence of ˆ d on theright-handsideby

o p t

d +

p

d = m to obtaina largerquantity.This gives

G

˜d ,

q

˜d = m ;

˜d = m

G

o p t

d +

p

d = m ; d = m

:

7Now because G ; is an increasing function of its secondargument, we can further weaken Equation (7) to obtain

G

˜d ,

q

˜d = m ; 0

G

o p t

d +

p

d = m ; d = m

:

8If we dene G 0 x = G x ; 0 , then since G ; is increasingin itsrst argument, G

, 10 is well-dened, and we may write

˜d G

, 10

G

o p t

d +

p

d = m ; d = m

+

q

˜d = m : 9

Now x any small value 0. For this , let d

0 be thesmallest value satisfying

o p t

d

0

mind

f

o p t

d g + —thus, d

0 is sufcient complexity to almost match the ap-proximative power of arbitrarily large complexity. Exam-ining the behavior of G

, 10 G

o p t

d

0

+

p

d

0

= m ; d

0

= m asm ! 1 , we see that the arguments approach the point

o p t

d

0

; 0 , and so G

, 10 G

o p t

d

0

+

p

d

0

= m ; d

0

= m ap-proaches G

, 10 G

o p t

d

0

; 0 =

o p t

d

0

min f

o p t

d g +

by continuity of G ; , as desired. By dening

R

G

m mind

n

G

, 10

G

o p t

d +

p

d = m ; d = m

o

10we obtain the statement of the theorem.

Let us now discuss the form of thebound given in Theorem 1.The rst term R

G

m approaches the optimal generalizationerror within

S

F

d

in the limit of large m , and the secondterm directly penalizes large complexity. These terms maybe thought of as competing. In order for R

G

m to approach

7We remind the reader that our bounds contain hidden logarith-mic factors that we specify in the full paper.

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 6/10

mind

f

o p t

d g rapidly and not just asymptotically (that is,in order to have a fast rate of convergence), G ; shouldnot penalize complexity too strongly, which is obviously at

odds with the optimization of the termq

˜d = m . For example,

consider G ˆ d ; d = m = ˆ d + d = m

for some power 0. Assuming d m , this rule is conservative (large

penalty for complexity) for small , and liberal (small penalty

for complexity) for large

. Thus, to make the term

q

˜d = m

small we would like to be small, to prevent the choiceof large ˜

d . However, R

G

m = mind

f

o p t

d +

p

d = m +

d = m

g , which increases as decreases, thus encouraginglarge (liberal coding).

Ideally, we might want G ; to balance the two terms of the bound, which implicitly involves nding an appropri-ately controlled but sufciently rapid rate of increase in ˜

d .The tension between these two criteria in the bound echoesthe same tension that was seen experimentally: for MDL,there was a long period of essentially uncontrolled growth of ˜

d (linear in m ), and this uncontrolled growth prevented anysignicant decay of generalization error (Figures 2 and 3).

GRM had controlled growth of ˜d

, and thus would incur neg-ligibleerror from our second term — but perhaps this growthwas too controlled, as it results in the initially slow (small m )decrease in generalization error.

To examine these issues further, we now apply the boundof Theorem 1 to several penalty-based algorithms. In somecases the nal form of the bound given in the theorem state-ment, while easy to interpret, is unnecessarily coarse, andbetter rates of convergence can be obtained by directly ap-pealing to the proof of the theorem.

We begin with a simplied GRM variant (SGRM), dened byG ˆ d ; d = m = ˆ d +

p

d = m . For this algorithm, we ob-serve that we can avoid weakening Equation (7) to Equation

(8), because here G

˜d ,

q

˜d = m ;

˜d = m =

˜d . Thus the

dependence on ˜d in the bound disappears entirely, resulting

in SGRM m min

d

n

o p t

d + 2p

d = m

o

: 11

This is not so mysterious, since SGRM penalizes strongly forcomplexity (even more so than GRM). This bound expressesthegeneralization erroras theminimum of thesum of thebestpossible error within each class F

d

and a penalty for com-plexity. Such a bound seems entirely reasonable, given thatit is essentially the expected value of the empirical quantitywe minimized to choose ˜

d in the rst place. Furthermore, if

o p t

d +

p

d = m approximates d well, then such a boundis about the best we could hope for. However, there is noreason in general to expect this to be the case. Bounds of thistype were rst given by Barron and Cover [1] in the contextof density estimation.

As an example of the application of Theorem 1 to MDL wecan derive the following bound on MDL m :

MDL m mind

f H

, 1 H

o p t

d +

p

d = m

+ H d = m g +

q

˜d MDL = m (12)

mind

f H

o p t

d + 2 H

p

d = m g

+

q

˜d MDL = m (13)

where we have used H

, 1 y y and H x + y H x +

H y . Again, we emphasize that the bound given by Equa-tion (13) is vacuous without a bound on ˜

d MDL , which weknow from the experiments can be of order m . However,

by combining this bound with an analysis of the behavior of ˜d MDL for the intervals problem, we can give an accurate the-oretical explanation for the experimental ndings for MDL(details in the full paper).

As a nal example, we apply Theorem 1 to a variant of MDLin which the penalty for coding is increased over the original,namely G ˆ d ; d = m = H ˆ d + 1 =

2H d = m where

is a parameter that may depend on d and m . Assuming thatwe never choose ˜

d whose total penalty is larger than 1 (whichholds if we simply add the “fair coin hypothesis” to F 1), wehave that H d = m

2 . Since H x x , for all x , it

follows thatq

˜d = m . If is some decreasing function

of m (say, m

for some 0 1), then the bound on

˜d

given by Theorem 1 decreases at a reasonable rate.

6 A Bound on the Additional Error of CVIn this section we state a general theorem bounding the addi-tional generalization error suffered by cross validation com-pared to any polynomial complexity model selection algo-rithm M . By this we mean that given a sample of size m ,algorithm M will never choose a value of ˜

d larger than m

k

for some xed exponent k 1. We emphasize that this is amild condition that is met in practically every realistic modelselection problem: although there are many documented cir-cumstances in which we may wish to choose a model whose

complexity is on the order of the sample size, we do notimagine wanting to choose, for instance, a neural network with a number of nodes exponential in the sample size. Inany case, more general but more complicated assumptionsmay be substituted for the notion of polynomial complexity,and we discuss these in the full paper.

Theorem 2 Let M be any polynomial complexity model se-lection algorithm, and let f F

d

g ; f ; D ; L be any instance of model selection. Let M m and CV m denote the expected generalization error of the hypotheses chosen by M and CV respectively. Then

CV m M 1 , m + O

p

log m = m : 14

Inother words, thegeneralization error of CV on m examplesis at most the generalization error M on 1 , m examples, plus the “test penalty term” O

p

log m = m .

Proof Sketch : Let S = S

0

; S

0 0 be a random sample of m examples, where j S

0

j = 1 , m and j S

0 0

j = m .Let d

m a x

= 1 , m

k be the polynomial bound on thecomplexity selected by M , and let ˜

h

0

1 2 F 1 ; : : : ;

˜h

0

d

m a x

2

F

d

m a x

be determined by ˜h

0

d

= L S

0

; d . By denitionof CV, ˜

d is chosen according to ˜d = argmin

d

f ˆ

S

0 0

˜h

0

d

g .

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 7/10

By standard uniform convergence arguments we have thatj

˜h

0

d

, ˆ

S

0 0

˜h

0

d

j = O

p

log m = m for all d d

m a x

with high probability over the draw of S

0 0 . Therefore withhigh probability

CV = mind

f

˜h

0

d

g + O

p

log m = m : 15

But as we have previously observed, the generalization errorof any model selection algorithm (including M ) on input S

0

is lower bounded by min d

f

˜h

0

d

g

, and our claim directlyfollows.

Note that the bound of Theorem 2 does not claim CV m

M m for all M (which would mean that cross validationis an optimal model selection algorithm). The bound givenis weaker than this ideal in two important ways. First, andperhaps most importantly, M 1 , m may be considerablylarger than M m . This could either be due to properties of the underlying learning algorithm L , or due to inherent phasetransitions (sudden decreases) in the optimal information-theoretic learning curve [8, 3] — thus, in an extreme case,it could be that the generalization error that can be achievedwithinsome class F

d

by training on m examples is close to 0,but that the optimal generalization error that can be achievedin F

d

by training on a slightly smaller sample is near 1 = 2.This is intuitively the worst case for cross validation — whenthesmall fraction of thesamplesaved fortesting wascriticallyneeded for training in orderto achieve nontrivialperformance— and is reected in the rst term of our bound. Obviouslythe risk of “missing” phase transitions can be minimized bydecreasing the test fraction , but only at the expense of increasing the test penalty term, which is the second way inwhich our bound falls short of the ideal. However, unlike thepotentially unbounded difference M 1 , m , M m ,our bound on the test penalty can be decreased without anyproblem-specic knowledge by simply increasing the testfraction .

Despite these two competing sources of additional CV er-ror, the bound has some strengths that are worth discussing.First of all, the bound holds for any model selection probleminstance f F

d

g ; f ; D ; L . We believe that giving similarlygeneral bounds for any penalty-based algorithm would beextremely difcult, if not impossible. The reason for this be-lief arises from the diversity of learning curve behavior doc-umented by the statistical mechanics approach [8, 3], amongother sources. In the same way that there is no universallearning curve behavior, there is no universal behavior forthe relationship between the functions ˆ d and d — therelationship between these quantities may depend criticallyon the target function and the input distribution (this pointis made more formally in Section 7). CV is sensitive to thisdependence by virtue of its target function-dependent anddistribution-dependentestimate of d . In contrast, by theirvery nature, penalty-based algorithms propose a universalpenalty to be assigned to the observation of error ˆ h for ahypothesis h of complexity d .

A more technical feature of Theorem 2 is that it can becombined with bounds derived for penalty-based algorithmsusing Theorem 1 to suggest how the parameter shouldbe tuned. For example, letting M be the SGRM algorithmdescribed in Section 5, and combining Equation (11) with

Theorem 2 yields CV m SGRM 1 , m

+

p

log d MAX m = m (16)

mind

n

o p t

d + 2p

d = 1 , m

o

+

p

log d MAX m = m (17)

If we knew the form of

o p t

d (or even had bounds on it),then in principle we could minimize the bound of Equation(17) as a function of to derive a recommended training/testsplit. Such a program is feasible for many specic problems(such as the intervals problem), or by investigating generalbut plausible bounds on the approximation rate

o p t

d , suchas

o p t

d c 0 = d for some constant c 0 0. We pursuethis line of inquiry in some detail in the full paper. For now,we simply note that Equation (17) tells us that in cases forwhich the power law decay of generalization error withineach F

d

holds approximately, the performance of CV will becompetitive with GRM or any other algorithm. This makesperfect sense in light of the preceding analysis of the twosources for additional CV error: in problems with power

law learning curve behavior, we have a power law boundon M 1 , m , M m , and thus CV “tracks” any otheralgorithm closely in terms of generalization error. This isexactly the behavior observed in the experiments describedin Section 4, for which the power law is known to holdapproximately.

7 Limitations on Penalty-Based AlgorithmsRecall that our experimental ndings suggested that it maysometimes be fair to think of penalty-based algorithms asbeing either conservative or liberal in the amount of codingthey are willing to allow in their hypothesis, and that bias ineither direction can result in suboptimal generalization that is

not easily overcome by tinkeringwith the form of the rule. Inthis section we treat this intuition more formally, by giving atheorem demonstrating some fundamental limitations on thediversity of problems that can be effectively handled by anyxed penalty-based algorithm. Briey, we show that thereare (at least) two very differentforms that the relationshipbe-tween ˆ d and d can assume, and that any penalty-basedalgorithm can perform well on only one of these. Further-more, for the problems we choose, CV can in fact succeedon both. Thus we are doing more than simply demonstratingthat no model selection algorithm can succeed universallyfor all target functions, a statement that is intuitively obvi-ous. We are in fact identifying a weakness that is special topenalty-based algorithms. However, as we have discussed

previously, the use of CV is not without pitfalls of its own.We therefore conclude the paper in Section 8 witha summaryof the different risks involved with each type of algorithm,and a discussion of our belief that in the absence of detailedproblem-specic knowledge, our overall analysis favors theuse of CV.

Theorem 3 For anysamplesize m , there are model selection problem instances f F

1d

g ; f 1 ; D 1 ; L and f F

2d

g ; f 2 ; D 2 ; L

(where L performs empirical error minimization in bothinstances) and a constant independent of m such that

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 8/10

for any penalty-based model selection algorithm G , either

1G

m mind

f 1 d g + or

2G

m mind

f 2 d g + . Here

i

d is the function d for instance i 2 f 1 ; 2 g , and

i

G

m is the expected generalization error of algorithm G

for instance i . Thus, on at least one of the two model selec-tion problems, the generalizationerror of G is lower bounded away from the optimal value min

d

f

i

d g by a constant in-dependent of m .

Proof Sketch : For notational convenience, in the proof weuse ˆ

i

d and

i

d ( i 2 f 1 ; 2 g ) to refer to theexpected valuesof these functions. We start with a rough description of theproperties of the two problems (see Figure 9): in Problem1, the “right” choice of d is 0, any additional coding directlyresults in larger generalization error, and the training error,ˆ 1 d , decays gradually with d . In Problem 2, a large amountof coding is required to achieve nontrivial generalization er-ror, and the training error remains large as d increases untild = m = 2, where the training error drops rapidly.

More precisely, we will arrange things so that the rst modelselection problem (Problem 1) has the following proper-ties (1) The function ˆ 1 d lies between two linear func-tions with y -intercepts 1 and 1 1 , 1 and common x -intercept 2 1 1 , 1 m m = 2; and (2) 1 d is mini-mized at d = 0, and furthermore, for any constant c wehave 1 c m c = 2. We will next arrange that the secondmodel selection problem (Problem 2) will obey: (1) Thefunction ˆ 2 d = a 1 for 0 d 2 1 1 , 1 m m = 2,where 1 1 , 1 a 1 ; and (2) The function 2 d is lowerbounded by a 1 for 0 d m = 2, but 2 m = 2 = 0. InFigure 9 we illustrate the conditions on ˆ d for the twoproblems, and also include hypothetical instances of ˆ 1 d

and ˆ 2 d that are consistent with these conditions (and arefurthermore representative of the “true” behavior of the ˆ d

functions actually obtained for the two problems we denemomentarily).

We can now give the underlying logic of the proof using thehypothetical ˆ 1 d and ˆ 2 d . Let ˜

d 1 denote the complexitychosen by G for Problem 1, and let ˜

d 2 be dened similarly.First consider the behavior of G on Problem 2. In this prob-lem we know by our assumptions on 2 d that if G failsto choose ˜

d 2 m = 2,

G

a 1, already giving a constantlower bound on

G

for this problem. This is the easier case;thus let us assume that ˜

d 2 m = 2, and consider the behaviorof G on Problem 1. Referring to Figure 9, we see that for0 d D 0 , ˆ 1 d ˆ 2 d , and thus

For 0 d D 0 ; G ˆ 1 d ; d = m G ˆ 2 d ; d = m 18

(because penalty-based algorithms assign greater penalties

for greater training error or greater complexity). Since wehave assumed that ˜d 2 m = 2, we know that

For d m = 2 ; G ˆ 2 d ; d = m G ˆ 2

˜d ;

˜d = m 19

and in particular, this inequality holds for 0 d D 0 . Onthe other hand, by our choice of ˆ 1 d and ˆ 2 d , ˆ 1

˜d 2 =

ˆ 2

˜d 2 = 0. Therefore,

G ˆ 1

˜d 2 ;

˜d 2 = m = G ˆ 2

˜d 2 ;

˜d 2 = m : 20

Combining the two inequalitiesabove(Equation 18 and Equa-tion 19) with Equation 20, we have that

For 0 d D 0 ; G ˆ 1 d ; d = m G ˆ 1

˜d 2 ;

˜d 2 = m

21from which it directly follows that in Problem 1, G cannot choose 0

˜d 1 D 0 . By the second condition on Problem

1 above, this implies that

G

D 0 ; if we arrange thatD 0 = c m for some constant c , then we have a constant lowerbound on

G

for Problem 1.

Due to space limitations, we defer the precise descriptionsof Problems 1 and 2 for the full paper. However, in Problem1 the classes F

d

are essentially those for the intervals modelselection problem, and in Problem 2 the F

d

are based onparity functions.

We note that although Theorem 3 was designed to create twomodel selection problems with the most disparate behaviorpossible,the proof techniquecan beused to give lowerboundson the generalization error of penalty-based algorithmsundermore general settings. In the full paper we will also arguethat for the two problems considered, the generalization errorof CV is in fact close to min

d

f

i

d g (that is, within a smalladditive term that decreases rapidly with m ) for both prob-lems. Finally, we remark that Theorem 3 can be strengthened

to hold for a single model selection problem (that is, a singlefunction class sequence and distribution), with only the targetfunction changing to obtain the two different behaviors. Thisrules out the salvation of the penalty-based algorithms viaproblem-specic parameters to be tuned, such as “effectivedimension”.

8 ConclusionsBased on both our experimental and theoretical results, weoffer the following conclusions:

Model selection algorithms that attempt to reconstruct thecurve d solelyby examining thecurve ˆ d oftenhave

a tendency to overcode or undercode in their hypothesisfor small sample sizes, which is exactly the sample sizeregime in which model selection is an issue. Such ten-dencies are not easily eliminated without suffering thereverse tendency.

There exist model selection problems in which a hypothesiswhose complexity is close to the sample size should bechosen, and in which a hypothesis whose complexity isclose to 0 shouldbe chosen, butthat generate ˆ d curveswith insufcient information to distinguish which is thecase. The penalty-based algorithms cannot succeed inboth cases, whereas CV can.

The error of CV can be bounded in terms of the error of any

other algorithm. The only cases in which the CV errormay be dramatically worse are those in which phasetransitions occur in the underlying learning curves at asample size larger than that held out for training by CV.

Thus we see that both types of algorithms considered havetheir own Achilles’ Heel. For penalty-based algorithms, itis an inability to distinguish two types of problems that callfor drastically different hypothesis complexities. For CV,it is phase transitions that unluckily fall between 1 , m

examples and m examples. On balance, we feel that the ev-

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 9/10

idence we have gathered favors use of CV in most commoncircumstances. Perhaps the best way of stating our posi-tion is as follows: given the general upper bound on CVerror we have obtained, and the limited applicability of anyxed penalty-based rule demonstrated by Theorem 3 and theexperimental results, the burden of proof lies with the prac-titioner who favors an penalty-based algorithm over CV. Inother words, such a practitioner should have concrete evi-

dence (experimental or theoretical) that their algorithm willoutperform CV on the problem of interest. Such evidencemust arise from detailed problem-specic knowledge, sincewe have demonstrated here the diversity of behavior that ispossible in natural model selection problems.

Acknowledgements

We give warm thanks to Yoav Freund and Ronitt Rubinfeld for theircollaboration on variousportions of the workpresented here,and fortheir insightful comments. Thanks to SebastianSeung and VladimirVapnik for interesting and helpful conversations.

References

[1] A. R. Barron and T. M. Cover. Minimum complexity den-sity estimation. IEEE Transactions on Information Theory ,37:1034–1054,1991.

[2] T. Cover and J. Thomas. Elements of Information Theory .Wiley, 1991.

[3] D. Haussler,M. Kearns,H.S. Seung, andN. Tishby. Rigourouslearningcurve bounds from statistical mechanics. In Proceed-ings of the Seventh Annual ACM Confernce on Computational Learning Theory , pages 76–87, 1994.

[4] J. R. Quinlan and R. L. Rivest. Inferring decision trees usingthe minimum description length principle. Information and Computation , 80(3):227–248, 1989.

[5] J. Rissanen. Modeling by shortest data description. Automat-

ica , 14:465–471, 1978.[6] J. Rissanen. Stochastic complexity and modeling. Annals of

Statistics , 14(3):1080–1100, 1986.

[7] J. Rissanen. Stochastic Complexity in Statistical Inquiry , vol-ume 15of Series in Computer Science . World Scientic,1989.

[8] H. S. Seung, H. Sompolinsky, and N. Tishby. Statisticalmechanics of learning from examples. Physical Review ,A45:6056–6091, 1992.

[9] M.Stone. Cross-validatory choice andassessment of statisticalpredictions. Journalof theRoyalStatistical Society B , 36:111–147, 1974.

[10] M. Stone. Asymptotics for and against cross-validation. Biometrika , 64(1):29–35, 1977.

[11] V. N. Vapnik. Estimation of DependencesBased on Empirical Data . Springer-Verlag, New York, 1982.

[12] V. N. Vapnik and A. Y. Chervonenkis. On the uniform conver-gence of relative frequencies of events to their probabilities.Theory of Probability and its Applications , 16(2):264–280,1971.

0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000

Figure 1: Experimental plots of the functions d (lower curvewith local minimum),

d (upper curve with local minimum) andˆ d (monotonically decreasing curve) versus complexity d for atarget function of 100 alternating intervals, sample size 2000 andnoise rate = 0 : 2. Each data point represents an average over 10trials. The attening of d and

d occurs at the point wherethe noisy sample can be realized with no training error.

0.1

0.2

0.3

0.4

0.5

500 1000 1500 2000 2500 3000

Figure 2: Experimental plots of generalization errors

MDL m

(most rapid initial decrease), CV m (intermediate initial decrease)and GRM m (least rapid initial decrease)versus sample size m fora target function of 100alternating intervals and noiserate = 0 : 20.Each data point represents an average over 10 trials.

0

100

200

300

400

500

600

700

500 1000 1500 2000 2500 3000

Figure 3: Experimental plots of hypothesis lengths ˜d MDL m

(most rapid initial increase), ˜d CV m (intermediate initial increase)

and ˜d GRM m (least rapid initial increase)versus sample size m for

a target function of 100alternating intervals and noiserate = 0 : 20.Each data point represents an average over 10 trials.

8/4/2019 (Summary) an Experimental and Theoretical Comparison of Model Selection Methods

http://slidepdf.com/reader/full/summary-an-experimental-and-theoretical-comparison-of-model-selection-methods 10/10

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

0 500 1000 1500 2000

Figure 4: MDL total penalty H ˆ d + H d = m versus com-plexity d for a single run on 2000 examples of a target function of 100 alternating intervals and noise rate = 0 : 20. There is a localminimum at approximately d = 100, and the global minimum atthe point of consistency with the noisy sample.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

0 1000 2000 3000 4000

Figure 5: MDL total penaltyH

ˆ d + H d = m

versus complex-ity d for a single run on 4000 examples of a target function of 100alternating intervals and noise rate = 0 : 20. The global minimumhas now switched from the point of consistency to the target valueof 100.

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 500 1000 1500 2000

Figure 6: GRM total penalty ˆ d + d = m 1 +

p

1 + ˆ d m = d

versus complexity d for a single run on 2000 examples of a targetfunction of 100 alternating intervals and noise rate = 0 : 20.

0.1

0.2

0.3

0.4

0.5

0 1000 2000 3000 4000 5000 6000

Figure 7: Experimental plots of generalization error GRM m

using complexity penalty multipliers 1.0 (slow initial decrease)and 0.5 (rapid initial decrease) on the complexity penalty term d = m 1 +

p

1 + ˆ d m = d versus sample size m on a target of 100 alternating intervals and noise rate = 0 : 20. Each data pointrepresents an average over 10 trials.

0

50

100

150

200

250

300

0 1000 2000 3000 4000 5000 6000

Figure8: Experimental plots of hypothesis length ˜d GRM m using

complexity penalty multipliers 1.0 (slow initial increase) and 0.5(rapid initial increase) on the complexity penalty term d = m 1 +

p

1 + ˆ d m = d versus sample size m on a targetof 100alternatingintervals and noise rate = 0 : 20. Each data point represents anaverage over 10 trials.

D_0PROBLEM 2BEST COMPLEXITY,

PROBLEM 1BEST COMPLEXITY,

PROBLEM 1TRAINING ERROR,

PROBLEM 2

TRAINING ERROR,

Figure9: Figure illustrating theproof of Theorem3. The darklinesindicate typical behavior for the two training error curves ˆ 1 d andˆ 2 d , and the dashed lines indicate the provable bounds on ˆ 1 d .