the apparent tradeoff between computational complexity and generalization of learning: a biased...

The Apparent Tradeoff between Computational Complexity

and Generalization of Learning:A Biased Survey of our Current Knowledge

The Apparent Tradeoff between Computational Complexity

and Generalization of Learning:A Biased Survey of our Current Knowledge

Shai Ben-David

Technion

Haifa, Israel

IntroductionIntroduction

The complexity of leaning is measured mainly along

two axis:: Information Information and computationcomputation..

Information complexityInformation complexity is concerned with the generalization

performance of learning. Namely, how many training examples are needed? What is the convergence rate of a learner’s estimate to the true population parameters?

The Computational complexity Computational complexity of learning is concerned with the computation applied to the data in order to deduce from it the Learner’s hypothesis.

It seems that when an algorithm improves with respect to one of these measures it deteriorates with respect to the other.

Outline of this TalkOutline of this Talk

1. Some background.

2. Survey of recent pessimistic computational

hardness results .

3. A discussion of three different directions for solutions:

a. The Support Vector Machines approach.

b. The Boosting approach (an agnostic learning variant).

c. Algorithms that are efficient for `well behaved’ inputs.

The Label Prediction ProblemThe Label Prediction Problem

Given some domainset XX

A sample SS of labeledmembers of XX is generated by some(unknown) distribution

For a next point xx , predict its label

Data extracted from grant applications.

Should current application be funded?

Applications in a sample are labeled by success/failure of

resulting projects.

Formal Definition Example

Two Basic Competing ModelsTwo Basic Competing Models

Sample labels are consistentwith some hh in HH

Learner hypothesis required to meet absolute Upper boundon its error

No prior restriction on the sample labels

The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class)

PAC framework Agnostic framework

Basic Agnostic Learning ParadigmBasic Agnostic Learning Paradigm

Choose a Hypothesis Class HH of subsets of XX.

For an input sample SS , find some hh in HH that fits SS well.

For a new point x , predict a label according to its membership in hh .

The Mathematical JustificationThe Mathematical Justification

Assume both the training sample and the test point

are generated by the same distribution over X x {0,1}X x {0,1} then,

If HH is not too rich (e.g., has small VC-dimension) then, for every hh in HH ,

the agreement ratio of hh on the sample SS

is a good estimate of its probability of success on a new xx .

The Computational ProblemThe Computational Problem

Input:: A finite set of {0, 1}{0, 1}-labeled

points SS in RRnn .

Output:: Some ‘hypothesis’ function h in H that

maximizes the number of correctly classified

points of S .

We shall focus on the class of LinearHalf-spaces

Find best hyperplane for arbitrary samples S

NP Hard

Find hyperplane approximating the optimal for arbitrary S

?

Find best hyperplane for separable S

Feasible

(Perceptron Algorithms)

For each of the following classes, approximating the

best agreement rate for h in HH (on a given input

sample SS ) up to some constant ratio, is NP-hard :Monomials Constant widthMonotone Monomials

Half-spacesHalf-spaces Balls Axis aligned RectanglesThreshold NN’s with constant 1st-layer width

BD-Eiron-Long

Bartlett- BD

Hardness-of-Approximation ResultsHardness-of-Approximation Results

Gaps in Our KnowledgeGaps in Our Knowledge

The additive constants in the hardness-

of-approximation results are 1%-2%.1%-2%.

They do not rule out efficient algorithms

achieving, say, 90%(optimal success rate).

However, currently, there are no efficient

algorithm performing significantly above

50%(optimal success rate) .

We shall discuss three solution paradigms

We shall discuss three solution paradigms

Kernel-Based methods

(including Support Vectors Machines).

Boosting (adapted to the Agnostic setting).

Data Dependent Success Approximation

Algorithms.

The Types of Errors to be ConsideredThe Types of Errors to be Considered

Output of the the learning Algorithm

Best regressor for DD

Approximation ErrorEstimation Error

Computational Error

}Hh:)h(Ermin{Arg

}Hh:)h(srEmin{Arg

The Class H

The Boosting Solution Basic IdeaThe Boosting Solution Basic Idea

“Extend the concept class as much as it can

be done without hurting generalizability”

The Boosting IdeaThe Boosting Idea

Given a hypothesis class HH and a labeled sample S S,

Rather than searching for a good hypothesis

in HH search in a larger class Co(H)Co(H). Important Gains:

1) A fine approximation can be found in Co(H)Co(H) in time polynomial in the time of finding a coarse approximation in HH.

2) The generalization bounds do not deteriorate when moving from HH to Co(H).Co(H).

Boosting Solution: Weak Learners Boosting Solution: Weak Learners

An algorithm is a An algorithm is a weak learner for a class for a class HH

if on every H-labeled weighted sample S,

it outputs some h in H

so that ErErS S (h) < ½ - (h) < ½ -

Boosting Solution: the Basic Result Boosting Solution: the Basic Result

Theorem [Schapire ’89, Freund ’90] :Theorem [Schapire ’89, Freund ’90] :

There is an algorithm that,

having access to an efficient weak learner,

for a PP -random HH -sample SS and parameters , ,

it finds some hh in Co(H)Co(H)

so that ErErP P (h) < (h) < ,with prob. .

In time polynomial in and (and |S||S|).

The Boosting Solution in PracticeThe Boosting Solution in Practice

The boosting approach was embraced by The boosting approach was embraced by

practitoners of Machine Learning and applied,practitoners of Machine Learning and applied,

quite successfully, to a wide variety of real-lifequite successfully, to a wide variety of real-life

problems.problems.

Theoretical Problems with the The Boosting Solution

Theoretical Problems with the The Boosting Solution

The boosting results assume that the input sampleThe boosting results assume that the input sample

labeling is consistent with some function in Hlabeling is consistent with some function in H

(the PAC framework assumption).(the PAC framework assumption).

In practice this is never the caseIn practice this is never the case..

The boosting algorithm’s success is based on havingThe boosting algorithm’s success is based on having

access to an efficient weak learner – access to an efficient weak learner –

no such learner existsno such learner exists..

Boosting Theory Attempt to RecoverBoosting Theory Attempt to Recover

Can one settle for weaker, realistic, assumptions?Can one settle for weaker, realistic, assumptions?

Agnostic weak leaners :Agnostic weak leaners :

an algorithm is a weak agnostic learner for HH,

if for every labeled sample SS it finds hh in HH s.t.

ErErS S (h) < Er(h) < ErS S (Opt(H)) + (Opt(H)) +

Revised Boosting SolutionRevised Boosting Solution

Theorem [B-D, Long, Mansour] :Theorem [B-D, Long, Mansour] :

There is an algorithm that, having access to

a weak agnostic learner, computes an h s.t.

ErErP P (h) < c Er(h) < c ErP P (Opt(H))(Opt(H))c’c’

(Where c c and c’c’ are constants depending on and hh is in Co(H)Co(H))

Problems with the The Boosting Solution

Problems with the The Boosting Solution

Only for a restricted family of classes, areOnly for a restricted family of classes, are

there known efficient agnostic weak learners.there known efficient agnostic weak learners.

The generalization bound we currently have,The generalization bound we currently have,

Contains an annoying exponentiation of the optimalContains an annoying exponentiation of the optimal

error. error.

Can this be improved?Can this be improved?

The SVM SolutionThe SVM Solution

“Extend the Hypothesis Class to guarantee computational feasibility”.

Rather than bothering with non-separable data, make the data separable - by embedding it into some high-dimensional RRnn

The SVM SolutionThe SVM Solution

The SVM ParadigmThe SVM Paradigm

Choose an Embedding of the domain X X into

some high dimensional Euclidean space,

so that the data sample becomes (almost)

linearly separable.

Find a large-margin data-separating hyperplane

in this image space, and use it for prediction.

Important gain: When the data is separable,

finding such a hyperplane is computationally feasible.

The SVM Solution in PracticeThe SVM Solution in Practice

The SVM approach is embraced by The SVM approach is embraced by

practitoners of Machine Learning and applied,practitoners of Machine Learning and applied,

very successfully, to a wide variety of real-lifevery successfully, to a wide variety of real-life

problems.problems.

A Potential Problem: GeneralizationA Potential Problem: Generalization

VC-dimension boundsVC-dimension bounds:: The VC-dimension of

the class of half-spaces in RRnn is n+1 n+1.

Can we guarantee low dimension of the embeddings range?

Margin boundsMargin bounds: : Regardless of the Euclidean dimension, ,

ggeneralization can bounded as a function of the margins

of the hypothesis hyperplane.

Can one guarantee the existence of a large-margin separation?

An Inherent Limitation of SVM ‘sAn Inherent Limitation of SVM ‘s

In “most” cases the data cannot be made separable unless the mapping is into dimension (|X|) (|X|) .

This happens even for classes of small

VC-dimension.

For “most” classes, no mapping for which concept-classified data becomes separable, has large margins.

In both cases generalization bounds are lost!In both cases generalization bounds are lost!

A third Proposal for Solution:Data-Dependent Success Approximations

A third Proposal for Solution:Data-Dependent Success Approximations

Note that the definition of success for agnostic learning is data-dependent;

The success rate of the learner on S is comparedto that of the best h in H.

We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

Data-Dependent Success ApproximationsData-Dependent Success Approximations

While Boosting, as well as Kernel-Based methods,

extend the class from which the algorithm picks

hypothesis, there is a natural alternative for

circumventing the hardness-of-approximation results:

shrinking the comparison class.

Our DDSA algorithms do it by imposing margins on

the comparison class hypotheses.

Data Dependent Success Definition for Half-spaces

Data Dependent Success Definition for Half-spaces

A learning algorithm AA is

marginsuccessful

if, for every input S S R Rnn {0,1} {0,1} ,

|{(x,y) S: A(s)(x) = y}| > |{(x,y): h(x)=y and d(h, x) >

for every half-space hh.

Some IntuitionSome Intuition

If there exist some optimal h which separates with generous margins, then a margin

algorithm must produce an optimal separator.

On the other hand,

If every good separator can be degraded by small perturbations, then a margin algorithm can settle for a hypothesis that is far from optimal.

The positive result

For every positive there is a - margin algorithm whose running time is polynomial in |S|S| and nn .

A Complementing Hardness Result

Unless P = NP , no algorithm can do this in time polynomial in and in |S||S| and nn ).

Some Obvious Open QuestionsSome Obvious Open Questions

Is there a parameter that can be

used to ensure good generalization for

Kernel –Based (SVM-like) methods?

Are there efficient Agnostic Weak Learners

for potent hypothesis classes?

Is there an inherent trade-off between

the generalization ability and the computational

complexity of algorithms?

THE END

“Old” Work“Old” Work

Hardness results: Blum and Rivest showed that it is NP-hard to optimize the weights of a 3-nodes NN.

Similar hardness-of-optimization results for other classes followed.

But learning can settle for less than But learning can settle for less than optimization.optimization.

Efficient algorithms: knownknown perceptron algorithms are efficient for linearly separable input data (or the image of such data under ‘tamed’ noise).

But natural data sets are usually not separableBut natural data sets are usually not separable..

A -margin Perceptron AlgorithmA -margin Perceptron Algorithm

On input S consider all k-size sub-samples.

For each such sub-sample find its largest margin separating hyperplane.

Among all the (~|S|k) resulting hyperplanes.choose the one with best performance on S .

(The choice of k is a function of the desired margin k ~

Other margin AlgorithmsOther margin Algorithms

Each of the following algorithms can replace the “find the largest margin separating hyperplane”

The usual “Perceptron Algorithm”.

“Find a point of equal distance from

x1 , … xk “.

Phil Long’s ROMMA algorithm.

These are all very fast online algorithms.

the apparent tradeoff between computational complexity and generalization of learning: a biased...

Documents

sample s x

h learner hypothesis

hypothesis function

input sample s

separable s slide

test point x x

given input s sample

agreement ratio of h