credentials our understanding of this topic is based on the work of many researchers. in particular:...

31
Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta Nadav Eiron Barbara Hammer David Klaus Hoffgen Lee Jones Michael Kearns Christian Kuhlman Phil Long Ron Rives Hava Siegelman Hans Ulrich Simon Eduardo Sontag Leslie Valiant Kevin Van Horn Santosh Vempala

Upload: dora-burns

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

CredentialsCredentials

Our understanding of this topic is based on the Work of many researchers. In particular:

Rosa ArriagaPeter BartlettAvrim BlumBhaskar DasGuptaNadav Eiron Barbara HammerDavid Haussler

Klaus HoffgenLee Jones

Michael Kearns Christian KuhlmanPhil LongRon RivesHava Siegelman

Hans Ulrich SimonEduardo SontagLeslie ValiantKevin Van HornSantosh VempalaVan Vu

Page 2: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

IntroductionIntroduction

Neural Nets are most popularpopular, effective,effective,

practical practical … … learning tools.

Yet,

after almost 40 years of research, there are no

efficient algorithms for learning with NN’s.

WHY?WHY?

Page 3: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Outline of this TalkOutline of this Talk

1. Some background.

2. Survey of recent strong hardness results.

3. New efficient learning algorithms for some basic NN architectures.

Page 4: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Label Prediction ProblemThe Label Prediction Problem

Given some domainset XX

A sample SS of labeledmembers of XX is generated by some(unknown) distribution

For a next point xx ,Predict its label

Data files of drivers

Will the customeryou interview file aclaim?

Drivers in a sample are labeled according to whether they filed an insurance claim

Formal Definition Example

Page 5: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Agnostic Learning ParadigmThe Agnostic Learning Paradigm

Choose a Hypothesis Class HH of subsets of XX.

For an input sample SS , find some hh in HH that fits SS well.

For a new point x , predict a label according to its membership in hh .

Page 6: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Mathematical JustificationThe Mathematical Justification

If HH is not too rich (has small VC-dimension)

then, for every hh in HH ,

the agreement ratio of hh on the sample SS is a good estimate of its probability of success on a new xx .

Page 7: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Mathematical Justification - FormallyThe Mathematical Justification - Formally

If SS is sampled i.i.d., by some DD over

X X {0, 1} {0, 1} then with probability > 1-> 1-

Agreement ratio

|S|

)ln()Hdim(VCc)y)x(h(Pr

|S|

|}y)x(h:Sy){(x,|

Hh allFor

D)y,x(

1

Probability of success

Page 8: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

A Comparison to ‘Classic’ PACA Comparison to ‘Classic’ PAC

Sample labels are consistentwith some hh in HH

Learner hypothesis required to meet absolute Upper boundon its error

No prior restriction on the sample labels

The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class)

PAC framework Agnostic framework

Page 9: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Model Selection IssueThe Model Selection Issue

Output of the the learning Algorithm

Best regressor for P

Approximation ErrorEstimation Error

Computational Error

}Hh:)h(Ermin{Arg

}Hh:)h(srEmin{Arg

The Class H

Page 10: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Big QuestionThe Big Question

Are there hypotheses classes that are:

1. Expressive (small approximation error).2. Have small VC-dim (small generalization error)3. Have efficient good-approximation algorithms

NN’s are quite successful as approximators (property 1).

If they are small (relative to the data size) then they also satisfy property 2.

We investigate property 3 for such NN’s.

Page 11: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Computational ProblemThe Computational Problem

For some class H of domain subsets

Input:: A finite set of {0, 1}-labeled points S in Rn .

Output:: Some h inin H that maximizes the number of correctly classified point of S .

Page 12: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

“Old” Work“Old” Work

Hardness results: Blum and Rivest showed that it is NP-hard to optimize the weights of a 3-nodes NN.

Similar hardness-of-optimization results for other classes followed.

But learning can settle for less than But learning can settle for less than optimization.optimization.

Efficient algorithms: knownknown perceptron algorithms are efficient for linearly separable input data (or the image of such data under ‘tamed’ noise).

But natural data sets are usually not separableBut natural data sets are usually not separable..

Page 13: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Focus of this TutorialThe Focus of this Tutorial

The results mentioned above (Blum and Rivest etc.) show that for many

“natural” NN’s finding such S-optimal h inin H is NP hard.

Are there efficient algorithms that output good approximations to the S-optimal

hypothesis?

Page 14: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

For each of the following classes there exist some constant s. t. approximating the best agreement rate for h in HH (on a given input sample SS ) up to this constant ratio, is NP-hard :

Monomials Constant widthMonotone Monomials

Half-spaces Balls Axis aligned RectanglesThreshold NN’s with constant 1st-layer

width

BD-Eiron-Long

Bartlett- BD

Hardness-of-Approximation ResultsHardness-of-Approximation Results

Page 15: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

How Significant are Such Hardness ResultsHow Significant are Such Hardness Results

All the above results are proved via reductions from some known-to-be-hard problem.

Page 16: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Relevant QuestionsRelevant Questions

1. Samples that are hard for one H are easy for another(a model selection issue).

2. Where do ‘naturally generated’ samples fall?

Page 17: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Data-Dependent SuccessData-Dependent Success

Note that the definition of success for agnostic learning is data-dependent;

The success rate of the learner on S is comparedto that of the best h in H.

We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

Page 18: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

A New Success CriterionA New Success Criterion

A learning algorithm AA is

marginsuccessful

if, for every input S S R Rnn {o,1} {o,1} ,

|{(x,y) S: A(s)(x) = y}| > |{(x,y): h(x)=y and d(h, x) >

for every hh H H .

Page 19: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Some IntuitionSome Intuition

If there exist some optimal h which separates with generous margins, then a margin

algorithm must produce an optimal separator.

On the other hand,

If every good separator can be degraded by small perturbations, then a margin algorithm can settle for a hypothesis that is far from optimal.

Page 20: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

First Virtue of the New MeasureFirst Virtue of the New Measure

The margin requirement is a rigorous performance guarantee that can be achieved by efficient algorithms

(unlike the common approximate optimization).

Page 21: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Another Appealing Feature of the New Criterion

Another Appealing Feature of the New Criterion

It turns out that for each of the three classes analysed so far (Half-spaces, Balls and Hyper-Rectangles), there exist a critical value so that:

marginlearnability is NP-hard for all

while, on the other hand,

For any , there exist a poly-time margin learning algorithm.

Page 22: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

A New Positive Result [B-D, Simon]A New Positive Result [B-D, Simon]

For every positive there is a poly-time algorithm that classifies correctly as many

input points as any half-space can classify correctly with margin >

Page 23: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The positive result

For every positive there is a poly-time algorithm that classifies correctly as many

input points as any half-space can classify correctly with margin >

A Complementing Hardness Result

Unless P = NP , no algorithm can do this in time polynomial in 1/and in |S| and n ).

Page 24: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Proof of the Positive Result (Outline)Proof of the Positive Result (Outline)

Best Separating Hyperplane

Best Separating Homogeneous Hyperplane

Densest Hemisphere (un-supervised input)

Densest Open Ball

We apply the following chain of reductions:

Page 25: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Input:Input: A finite set P of points on the unit sphere Sn-1 .

Output:Output: An open Ball B of radious 1 so that |B P| is maximized.

The Denset Open Ball ProblemThe Denset Open Ball Problem

Sn-1

B

Page 26: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Algorithms for the Densest Open Ball Problem Algorithms for the Densest Open Ball Problem

Alg. 1. For every x1, …xn P ,

• find the center of their minimal enclosing Ball, Z(x1, …, xn)

• Check |B[Z(x1, …, xn), 1] P|

Output the ball with maximum intersection with P

Running time: ~|P|n exponential in n !

Page 27: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Another Algorithm (for the Densest Open Ball Problem)Another Algorithm (for the Densest Open Ball Problem)

Fix a parameter k << n ,

Alg. 2. Apply Alg. 1 only for subsets of size < k , i.e.,

For every x1, …xk P ,

• find the center of their minimal enclosing Ball, Z(x1, …, xk)

• Check |B[Z(x1, …, xk), 1] P|

Output the ball with maximum intersection with P

Running time: ~|P|k

But, does it output a good hypothesis?

Page 28: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Our Core Mathematical Result Our Core Mathematical Result

The following is a local approximation result.

It shows that computations from local data (k-size subsets)can approximate global computations,

with precision guarantee depending only on the local parameter, k.

Theorem: For every k < n and x1 … xn on the unit sphere

Sn-1 , there exist a subset

So that

k

1

kn

kn...,Z...,Z

ki1in1 xxxx

}x...,,x{}x...,,x{ n1ii k1

Page 29: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

The Resulting Perceptron AlgorithmThe Resulting Perceptron Algorithm

On input S consider all k-size sub-samples.

For each such sub-sample find its largest margin separating hyperplane.

Among all the (~|S|k) resulting hyperplanes.choose the one with best performance on S .

(The choice of k is a function of the desired margin k ~

Page 30: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

A Different, Randomized, AlgorithmA Different, Randomized, Algorithm

Avrim Blum noticed that the ‘randomized projection’ algorithm of Rosa Ariaga and Santosh Vempala ‘99achieves, with high probability, a similarPerformance as our algorithm.

Page 31: Credentials Our understanding of this topic is based on the Work of many researchers. In particular: Rosa Arriaga Peter Bartlett Avrim Blum Bhaskar DasGupta

Directions for Further ResearchDirections for Further Research

Can similar efficient algorithms be derived for more complex NN architectures?

How well do the new algorithms perform on real data sets?

Can the ‘local approximation’ results be extended to more geometric functions?