why does averaging work? yoav freund at&t labs plan of talk generative vs. non-generative...

45
Why does averaging work? Yoav Freund AT&T Labs

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Why does averaging work?

Yoav Freund AT&T Labs

Page 2: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Plan of talk

• Generative vs. non-generative modeling

• Boosting

• Boosting and over-fitting

• Bagging and over-fitting

• Applications

Page 3: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Toy Example

• Computer receives telephone call

• Measures Pitch of voice

• Decides gender of caller

HumanVoice

Male

Female

Page 4: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Generative modeling

Voice Pitch

Pro

babi

lity

mean1

var1

mean2

var2

Page 5: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Discriminative approach

Voice Pitch

No.

of

mis

take

s

Page 6: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Ill-behaved data

Voice Pitch

Pro

babi

lity

mean1 mean2

No.

of

mis

take

s

Page 7: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Traditional Statistics vs. Machine Learning

DataEstimated world state

PredictionsActionsStatistics

Decision Theory

Machine Learning

Page 8: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Another example

• Two dimensional data (example: pitch, volume)

• Generative model: logistic regression

• Discriminative model: separating line ( “perceptron” )

Page 9: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

w

xwxw

xw

eeexyP

)|1(Model:

Find W to maximize i

ii xyP )|(log

Page 10: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

w

xwxw

xw

eeexyP

)|1(Model:

Find W to maximize i

ii xyP )|(log

Page 11: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Model Generative Discriminative

Goal Probability estimates

Classification rule

Performance measure

Likelihood Misclassification rate

Mismatch problems

Outliers Misclassifications

Comparison of methodologies

Page 12: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting
Page 13: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Adaboost as gradient descent

• Discriminator class: a linear discriminator in the space of “weak hypotheses”

• Original goal: find hyper plane with smallest number of mistakes – Known to be an NP-hard problem (no algorithm that runs in

time polynomial in d, where d is the dimension of the space)

• Computational method: Use exponential loss as a surrogate, perform gradient descent.

• Unforeseen benefit: Out-of-sample error improves even after in-sample error reaches zero.

Page 14: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Margins view}1,1{;, yRwx n )( xwsign Prediction =

+

- + +++

+

+

-

--

--

--

w

Correc

t

Mist

akes

Project

Margin

Cumulative # examples

Mistakes Correct

Margin = xw

xwy

)(

Page 15: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Adaboost et al.

Loss

Correct

MarginMistakes

Brownboost

LogitboostAdaboost = )( xwye

0-1 loss

Page 16: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

One coordinate at a time

• Adaboost performs gradient descent on exponential loss

• Adds one coordinate (“weak learner”) at each iteration.

• Weak learning in binary classification = slightly better than random guessing.

• Weak learning in regression – unclear.

• Uses example-weights to communicate the gradient direction to the weak learner

• Solves a computational problem

Page 17: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting
Page 18: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Curious phenomenonBoosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 19: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Explanation using margins

Margin

0-1 loss

Page 20: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Page 21: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Experimental Evidence

Page 22: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

TheoremFor any convex combination and any threshold

Probability of mistake

Fraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on numberNo dependence on number of weak rules of weak rules that are combined!!!that are combined!!!

Schapire, Freund, Bartlett & LeeAnnals of stat. 98

Page 23: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Suggested optimization problem

Margin

m

d

Page 24: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Idea of Proof

Page 25: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting
Page 26: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

gf

A metric space of classifiersClassifier space Example Space

d

d(f,g) = P( f(x) = g(x) )

Neighboring models make similar predictions

Page 27: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Confidence zones

Voice Pitch

No.

of

mis

take

s Bagged variantsCertainly Certainly

Majority vote

Page 28: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Bagging

• Averaging over all good models increases stability (decreases dependence on particular sample)

• If clear majority the prediction is very reliable (unlike Florida)

• Bagging is a randomized algorithm for sampling the best model’s neighborhood

• Ignoring computation, averaging of best models can be done deterministically, and yield provable bounds.

Page 29: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

A pseudo-Bayesian method

Freund, Mansour, Schapire 2000

Define a prior distribution over rules p(h)

For each h : err(h) = (no. of mistakes h makes)/m

Get m training examples,

)()()( herrehphw Weights:

1)(,

1)(,

)(

)(

ln1

)(

xhh

xhh

hw

hw

xl

Log odds:

m1 m;set

l(x)>+

l(x)< - elsePrediction(x) =

+1

-1

0

Page 30: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting
Page 31: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Applications of Boosting

• Academic research

• Applied research

• Commercial deployment

Page 32: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Academic research

Database Other BoostingError reduction

Cleveland 27.2 (DT) 16.5 39%

Promoters 22.0 (DT) 11.8 46%

Letter 13.8 (DT) 3.5 74%

Reuters 4 5.8, 6.0, 9.8 2.95 ~60%

Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

Page 33: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Applied research

• “AT&T, How may I help you?”• Classify voice requests• Voice -> text -> category• Fourteen categories

Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time

Schapire, Singer, Gorin 98

Page 34: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

• Yes I’d like to place a collect call long distance please

• Operator I need to make a call but I need to bill it to my office

• Yes I’d like to place a call on my master card please

• I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill

Examples

collect

third party

billing credit

calling card

Page 35: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Calling card

Collect

call

Third party

Weak Rule

Category

Word occursWord does not occur

Weak rules generated by “boostexter”

Page 36: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Results

• 7844 training examples– hand transcribed

• 1000 test examples– hand / machine transcribed

• Accuracy with 20% rejected– Machine transcribed: 75%– Hand transcribed: 90%

Page 37: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Commercial deployment

• Distinguish business/residence customers

• Using statistics from call-detail records

• Alternating decision trees– Similar to boosting decision trees, more flexible– Combines very simple rules– Can over-fit, cross validation used to stop

Freund, Mason, Rogers, Pregibon, Cortes 2000

Page 38: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Massive datasets

• 260M calls / day

• 230M telephone numbers

• Label unknown for ~30%

• Hancock: software for computing statistical signatures.

• 100K randomly selected training examples,

• ~10K is enough

• Training takes about 2 hours.

• Generated classifier has to be both accurate and efficient

Page 39: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Alternating tree for “buizocity”

Page 40: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Alternating Tree (Detail)

Page 41: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Precision/recall graphs

Score

Acc

urac

y

Page 42: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Business impact

• Increased coverage from 44% to 56%

• Accuracy ~94%

• Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

Page 43: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Summary

• Boosting is a computational method for learning accurate classifiers

• Resistance to over-fit explained by margins• Underlying explanation –

large “neighborhoods” of good classifiers• Boosting has been applied successfully to a

variety of classification problems

Page 44: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting

Please come talk with me

• Questions welcome!

• Data, even better!!

• Potential collaborations, yes!!!

[email protected]

• www.predictiontools.com

Page 45: Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting