tti uchicago june09 workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-sat/... ·...

30
More Data, Less Work: Runtime as a decreasing function of data set size Nati Srebro Toyota Technological Institute—Chicago

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

More Data, Less Work:Runtime as a decreasing function

of data set size

Nati SrebroToyota Technological Institute—Chicago

Page 2: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Outline

SVM Clusteringspeculations,

other problemswild speculations,

other problems

we arehere

•SVM Optimization: Inverse Dependence on Training Set SizeShai Shalev-Shwartz (TTI), N Srebro

•Fast Rates for Regularized ObjectivesKarthik Sridharan (TTI), Shai Shalev-Shwartz (TTI), N Srebro, NIPS’08

•Pegasos: Primal Estimated sub-GrAdient SOlver for SVMShai Shalev-Shwartz (TTI), Yoram Singer (Google), N Srebro, ICML’07

•An Investigation of Comp. and Informational Limits in Gaussian Mixture ClusteringN Srebro, Greg Shakhnarovich (TTI), Sam Roweis (Google/Toronto), ICML’06

Page 3: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Large Margin Linear Classificationaka L2-regularized Linear Classification

aka Support Vector Machines

|w|=1w

<w,x> ≥ M

<w,x> ≤ -M

Error: [1-y<w,x>]+

M

?

Page 4: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Large Margin Linear Classificationaka L2-regularized Linear Classification

aka Support Vector Machines

Margin: M = 1/|w|w

<w,x> ≤ -1

<w,x> ≥ 1

Error: [1-y<w,x>]+

M

?

Page 5: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

SVM Training as anOptimization Problem

• IP method on dual (standard QP solver):O(n3.5 log log(1/ε))

• Dual decomposition methods (e.g. SMO):O(n2 d log(1/ε)) [Platt 98][Joachims 98][Lin 02]

• Primal cutting plane method (SVMperf):O( nd / (λε) ) [Joachims 06][Smola et al 08]

Runtime to get f(w) ≤ min f(w) + ε

Page 6: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

More Data ⇒ More Work?10k training examples 1 hour 2.3% error

(when usingthe predictor)

1M training examples 1 week (or more…) 2.29% error

10 minutes 2.3% error

But I really care about that 0.01% gain

Can always sample and get same runtime:

Can we leverage the excess data to reduce runtime?

1 hour 2.3% error

Study runtime increase as a function of target accuracy

Study runtime increase as a function of problem difficulty (e.g. small margin)

My problem is so hard, I have to crunch 1M examples

Page 7: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

SVM Training

• Optimization objective:

• True objective: prediction error on future exampleserr(w) = Ex,y[error of w’x vs. y] ≈ E[ [1-y〈w,x〉]+ ]

• Would like to understand computational cost in terms of:• Increasing function of:

– Desired generalization performance (i.e. as err(w) decreases)– Hardness of problem:

margin, noise (unavoidable error)

• Decreasing function of available data set size

Page 8: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Error Decomposition

• Approximation error:– Best error achievable by large-margin predictor– Error of population minimizer

w0 = argmin E[f(w)] = argmin λ|w|2 + E[loss(w)]

• Estimation error:– Extra error due to replacing E[loss] with empirical loss

w* = arg min fn(w) = arg min λ |w|2 + loss(w on training set)

• Optimization error:– Extra error due to only optimizing to within finite precision

err(w0)

err(w*)

err(w)Prediction error

Page 9: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

The Double-Edged Sword

• When data set size increases:– Estimation error decreases– Can increase optimization error,

i.e. optimize to within lesser accuracy ⇒ fewer iterations– But handling more data is expensive

e.g. runtime of each iteration increases

• PEGASOS (Primal Efficient Sub-Gradient Solver for SVMs) [Shalev-Shwartz Singer S 07]

– Fixed runtime per iteration– Runtime to get fixed accuracy does not increase with n

err(w0)

err(w*)

err(w)

data set size (n)

Error DecompositionPrediction error

Page 10: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

PEGASOS: Stochastic (sub-)Gradient Descent

• Initialize w=0

• At each iteration t,with random data point (xi,yi):

subgradient ofλ|w|2+[1-yi<w,xi>]+

• Theorem: After at most iterations, f(wPEGASOS) ≤ minw f(w)+ε,with probability ≥ 1-δ

• With d-dimensional (or d-sparse) features, each iteration takes time O(d)

• Conclusion: Run-time required for PEGASOS to find ε accurate solution with constant probability:

• Run-time does not depend on #examples

Page 11: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Training Time (in seconds)

8052Physics ArXiv(62k examples, 100k features)

25,514856Covertype(581k examples, 54 features)

20,075772Reuters CCAT (800K examples, 47k features)

SVM-Light [Joachims]

SVM-Perf[Joachims06]

Pegasos

Page 12: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Runtime Analyzis

If there is some predictor w0 with low |w0| and low err(w0),how much time to find predictor with err(w) ≤ err(w0)+ε

large margin M=1/|w0|

Data Laden analysis: Restricted by computation, not data

λ = O(ε/|w0|2)εacc = O(ε)n = Ω(1/(λ ε)) = Ω(|w0|2/ε2)

Traditional Data Laden:f(w)<f(w*)+εacc err(w)≤ err(w0)+ε

Interior Point n3.5 log(log(1/εεεεacc)) |w0|7/ε7

SMO n2 d log(1/εεεεacc) d |w0|4/ε4

SVMPerf n d / (λλλλ εεεεacc) d |w0|4/ε4

PEGASOS d / (λλλλ εεεεacc) d |w0|2/ε2

Unlimited data available, can choose working data-set size

To get err(w) ≤ err(w0)+O(ε):

(ignoring log-factors)

Page 13: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on Data Set Size

Training set Size

Run

time

Minimal Training Size(Stat Learning Theory)

T = Ω

d(ǫM −O

(1√n

))2

PEGASOS guaranteed runtime to get error err(w0)+εwith n training points:

Target error

Pre

dict

ion

erro

r

Training set Size

Runtime to get target error

Minimal Runtime(Data Laden)

approx. errapprox. err

est. errest. err

opt. erropt. err

Increases for smaller target

error

Increases for smaller margin

Decreases for larger data set

Page 14: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on Data Set Size

Training set Size

Run

time

Minimal Training Size(Stat Learning Theory)

300,000 500,000 700,0000

2.5

5

7.5

10

12.5

15

Mill

ion

Itera

tions

(∝

runt

ime)

Training Set Size

Target error

Pre

dict

ion

erro

r

Training set Size

Runtime to get target error

Minimal Runtime(Data Laden)

err<5.25% on Reuters CCAT

approx. errapprox. err

est. errest. err

opt. erropt. err

Page 15: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on Data Set Size

Training set Size

Run

time

Minimal Training Size(Stat Learning Theory)

300,000 500,000 700,0000

2.5

5

7.5

10

12.5

15

Mill

ion

Itera

tions

(∝

runt

ime)

Training Set Size

err(w) ≤ err(w0) + λ|w0|2 + O(1/(λn)) + O(d/(λT))

Increase λ as training size increases!More regularization, less predictors allowedLarger approximation error err(w0)+λ|w0|2

Faster runtime T ∝ 1/λ

Target error

Pre

dict

ion

erro

r

Training set Size

Runtime to get target error

m

Minimal Runtime(Data Laden)

err<5.25% on Reuters CCAT

approx. errapprox. err

est. errest. err

opt. erropt. err

Page 16: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on Data Set Size:Traditional Optimization Approaches

Run

tim

e

Training Set Size

Run

tim

e

SVM-Perf

SMO

Page 17: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on Data Set Size:Traditional Optimization Approaches

Run

tim

e

Training Set Size

Run

tim

e

SVM-Perf

SMO

Page 18: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Beyond PEGASOS

• Other machine learning problems– Kernalizes SVMs– L1-regularization, e.g. LASSO– Matrix/factor models (e.g. with trace-norm regularization)– Multilayer / deep networks…

• Can we more explicitly leverage excess data?– Playing only on the error decomposition,

const × minimum-sample-complexity is enough to get to const × minimum-data-laden-runtime

Page 19: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Clustering(by fitting a Gaussian mixture model)

•Find centers (µ1,…,µk) minimizing objective:–Negative log-likelihood under Gaussian mixture model:

-Σi log( Σj exp -(xi-µj)2/2 )–k-means objective ≈ negative log-likelihood of assignment:

Σi minj (xi-µj)2

Page 20: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Clustering(by fitting a Gaussian mixture model)

• Clustering is hard in the worst-case• Given LOTS of data and HUGE separation:

– Can efficiently recover true clustering[Dasgupta 99][Dasgupta Schulman 00][Arora Kannan 01][Vempala Wang 04] [Achliopts McSherry 05][Kannan Salmasian Vempala 05]

– EM works (empirically)

• With too little data, clustering is meaningless:– Even if we find the ML clustering, it has nothing to do with

underlying distribution

“Clustering isn’t hard—it’s either easy, or not interesting”

Page 21: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Effect of “Signal Strength”

Not enough data—“optimal” solution is meaningless.

Lots of data—true solution creates distinct peak.Easy to find.

Just enough data—optimal solution is meaningful, but hard to find?

~Informationallimit

Computationallimit ~

Larger data set

Smaller data set

Page 22: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Computational and Information Limits in Clustering

600 1000 2000 4000

What EM finds

ML

runt

ime

exponential

poly

data set size

data set size

data

clu

ster

ing

erro

r

k=16 clusters in d=512 dimensions, sep=4σ

Page 23: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

200 400 600 800 1000

2000

4000

6000 k=16 s=4

d=sa

mpl

e si

ze

500 1000 1500 20000

2000

4000

6000

k=16 s=6

d=

sam

ple

size

5 10 15 20 250

2500

5000

7500 d=512 s=4

k=

sam

ple

size

5 10 15 20 25 300

1000

2000

3000 d=512 s=6

k=

sam

ple

size

Dependence on dimensionality (d) and number of clusters (k)

EM finds M

L

Gap: EM fails, M

L better

ML no good

Page 24: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Dependence on thecluster separation

2.8 4 5 6 7 81/1000

1/100

1/10

1

10

sam

ple

size

/ (d

im *

#cl

ustr

ers)

separation (in units of σ)

EM finds ML

Gap: EM fails, ML better

ML no good

n = 9.7 kd/s2.2

n = 131 kd/s4.8

Page 25: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Conclusions from Empirical Study

• With enough samples, EM does find global ML, even with low separation

• There is an informational cost to tractability(at least when using known methods)

• Cost of tractability: PCA+EM+pruning (best known method) may require about s2 as much data as what is statistically necessary

• Cost increases when separation increases

Page 26: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Hardness as a Function of Dataset Size

ndataset size

“not interesting”—optimum does not

correspond to true solution provable polytime

EM**empiricallypolytime

Page 27: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Hardness as a Function of Dataset Size

ndataset size

“not interesting”—optimum does not

correspond to true solution provable polytime

EM**empiricallypolytime

polytime

Page 28: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Hardness as a Function of Dataset Size

ndataset size

“not interesting”—optimum does not

correspond to true solution provable polytime

EM**empiricallypolytime

polytimehard: provably no polytime algorithm

Page 29: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

Informational Cost of Tractability?

• Gaussian Mixture Clustering• Learning structure of dependency networks

– Hard to find optimal (ML) structure in the worst case [Srebro 01]

– Polynomial-time algorithms for the large-sample limit [ChechetkaGuestrin 07]

• Graph partitioning (correlation clustering)– Hard in the worst case– Easy for large graphs with a “nice” partitions [McSherry 03]

• Finding cliques in random graphs• Planted Noisy MAX-SAT

Page 30: TTI UChicago June09 Workshop serializedweb.cse.ohio-state.edu/mlss09/mlss09_talks/6.june-SAT/... · 2009-06-29 · Outline SVM Clustering speculations, other problems wild speculations,

• Required runtime:– increases with complexity of the answer (separation, decision boundary)

– increases with desired accuracy

– decreases with amount of available data• PEGASOS (stochastic sub-gradient descent for SVMs):

– Runtime to get fixed optimization accuracy doesn’t depend on n→ Best performance in data-laden regime→ Runtime decreases as more data is available

• Clustering

– Past informational limit, extra data is needed to make problem tractable– Cost of tractability increases quadratic with cluster seperation

More Data ⇒ Less Work

0 2 4 6 8 10data set size

runt

ime

data set size