oznur tastan 10601 machine learning recitation 3 sep 16 2009

58
Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Post on 21-Dec-2015

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Oznur Tastan

10601 Machine LearningRecitation 3Sep 16 2009

Page 2: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Outline

• A text classification example– Multinomial distribution– Drichlet distribution

• Model selection– Miro will be continuing in that topic

Page 3: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text classification example

Page 4: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text classification

• We are not into classification yet.• For the sake of example,

I’ll briefly go over what it is.

Classification Task:You have an input x, you classify which label it has y from some fixed set of labels y1,...,yk

Page 5: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text classification spam filteringInput: document DOutput: the predicted class y from {y1,...,yk }

Spam filtering:

Classify email as ‘Spam’, ‘Other’.

P (Y=spam | X)

Page 6: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text classificationInput: document DOutput: the predicted class y from {y1,...,yk }

Text classification examples:

Classify email as ‘Spam’, ‘Other’.

What other text classification applications you can think of?

Page 7: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text classificationInput: document xOutput: the predicted class y y is from {y1,...,yk }

Text classification examples:

Classify email as ‘Spam’, ‘Other’.

Classify web pages as ‘Student’, ‘Faculty’, ‘Other’

Classify news stories into topics‘Sports’, ‘Politics’..

Classify business names by industry.

Classify movie reviews as ‘Favorable’, ‘Unfavorable’, ‘Neutral’ … and many more.

Page 8: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Text Classification: Examples Classify shipment articles into one 93 categories. An example category ‘wheat’

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts, as follows....

Page 9: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Representing text for classificationARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for sub-products, as follows....

y

How would you represent the document?

Page 10: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Representing text: a list of words

argentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …

Common refinements: remove stopwords, stemming, collapsing multiple occurrences of words into one….

y

Page 11: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Representing text for classificationARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for sub-products, as follows....

y

How would you represent the document?

Page 12: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

‘Bag of words’ representation of text

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for sub-products, as follows....

grain(s) 3

oilseed(s) 2

total 3

wheat 1

maize 1

soybean 1

tonnes 1

... ...

word frequency

Bag of word representation:

Represent text as a vector of word frequencies.

Page 13: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Bag of words representation

document i

Frequency (i,j) = j in document i

word j

A collection of documents

Page 14: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Bag of words

What simplifying assumption are we taking?

Page 15: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Bag of words

What simplifying assumption are we taking?

We assumed word order is not important.

Page 16: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

‘Bag of words’ representation of text

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for sub-products, as follows....

grain(s) 3

oilseed(s) 2

total 3

wheat 1

maize 1

soybean 1

tonnes 1

... ...

word frequency

Pr( | )D Y y ?

1 1 1Pr( ,..., | )kW n W n Y y

Page 17: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Multinomial distribution• The multinomial distribution is a generalization of the binomial

distribution.

• The binomial distribution counts successes of an event (for example, heads in coin tosses).

• The parameters:– N (number of trials) – (the probability of success of the event)

• The multinomial counts the number of a set of events (for example, how many times each side of a die comes up in a set of rolls). – The parameters:– N (number of trials) – (the probability of success for each category)

1.. k

Page 18: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Multinomial DistributionFrom a box you pick k possible colored balls. You selected N balls randomly and put into your bag.

Let probability of picking a ball of color i is

For each color

Wi be the random variable denoting the number of balls selected in color i, can take values in {1…N}.

1,.., k i

Page 19: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Multinomial Distribution

W1,W2,..Wk are variables

1 2

11 1 1 1 21 2

!( ,..., | , ,.., ) ..

! !.. !knn n

k k kk

NP W n W n N

n n n

1

k

ii

n N

1

1k

ii

Number of possible orderings of N balls

order invariant selections

Note events are indepent

A binomial distribution is the multinomial distribution with k=2 and 1 2 2, 1

Page 20: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

‘Bag of words’ representation of text

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil)

The board also detailed export registrations for sub-products, as follows....

grain(s) 3

oilseed(s) 2

total 3

wheat 1

maize 1

soybean 1

tonnes 1

... ...

word frequency

Page 21: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

‘Bag of words’ representation of text

grain(s) 3

oilseed(s) 2

total 3

wheat 1

maize 1

soybean 1

tonnes 1

... ...

word frequencyCan be represented as a multinomial distribution.

Words = colored balls, there are k possible typeof them

Document = contains N words, each wordoccurs ni times

The multinomial distribution of words is going to bedifferent for different document class.

In a document class of ‘wheat’, grain is more likely.where as in a hard drive shipment the parameter for ‘wheat’ is going to be smaller.

Page 22: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Multinomial distribution andbag of wordsRepresent document D as list of words w1,w2,..

For each category y, build a probabilistic model Pr(D|Y=y)

Pr(D={argentine,grain...}|Y=wheat) = ....Pr(D={stocks,rose,in,heavy,...}|Y=nonWheat) = ....

To classify, find the y which was most likely togenerate D

Page 23: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Conjugate distribution

• If the prior and the posterior are the same distribution, the prior is called a conjugate prior for the likelihood

• The Dirichlet distribution is the conjugate prior for the multinomial, just as beta is conjugate prior for the binomial.

Page 24: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Drichlet distribution

The Dirichlet parameter i can be thought of as a prior count of the ith class.

Dirichlet distribution generalizes the beta distributionjust like multinomial distribution generalizes the binomialdistribution

Gamma function

Page 25: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Dirichlet Distribution

Let’s say the prior for is

From observations we have the following counts The posterior distribution for given data

1( ,.., )kDir

1 1( ,.., )k kDir n n

1,.., k

1,.., kn n

1,.., k

So the prior works like a pseudo-counts.

Page 26: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Pseudo Count and prior• Let’s say you estimated the probabilities from a collection of

documents without using a prior.

• For all unobserved words in your document collection, you would assign zero probability to that word occurring in that document class. So whenever a document with that word comes in, the probability will be zero for that document being in that class. Which is probably wrong when you have only limited data.

• Using priors is a way of smoothing the probability distributions and leaving out some probability mass for the unobserved events in your data.

Page 27: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Generative model

C

w

ND

:

: Word: Document class generating the word: Number of word in the document: Collection of documentsmatrix of parameters for the multionomial specific to that document class

: Dirichlet( ) prior

W

C

N

D

for

Page 28: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Model Selection

Page 29: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Polynomial Curve FittingBlue: Observed data True: Green true distribution

Page 30: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Sum-of-Squares Error Function

Page 31: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

0th Order PolynomialBlue: Observed data Red: Predicted curve True: Green true distribution

Page 32: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

1st Order PolynomialBlue: Observed data Red: Predicted curve True: Green true distribution

Page 33: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

3rd Order PolynomialBlue: Observed data Red: Predicted curve True: Green true distribution

Page 34: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

9th Order PolynomialBlue: Observed data Red: Predicted curve True: Green true distribution

Page 35: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Which of the predicted curve is better?

Blue: Observed data Red: Predicted curve True: Green true distribution

Page 36: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

What do we really want?

Why not choose the method with thebest fit to the data?

Page 37: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

What do we really want?

Why not choose the method with thebest fit to the data?

If we were to ask you the homework questions in the midterm, would we have a good estimate of how well you learned the concepts?

Page 38: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

What do we really want?

Why not choose the method with thebest fit to the data?

How well are you going to predictfuture data drawn from the samedistribution?

Page 39: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Example

Page 40: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

General strategyYou try to simulate the real word scenario. Test data is your future data. Put it away as far as possible don’t look at it.

Validation set is like your test set. You use it to select your model.The whole aim is to estimate the models’ true error on the sample data you have.

!!! For the rest of the slides ..Assume we put the test data aldready away. Consider it as the validation data when it says test set.

Page 41: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Test set method• Randomly split some portion of your data Leave it aside as the test set• The remaining data is the training data

Page 42: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Test set method• Randomly split some portion of your data Leave it aside as the test set• The remaining data is the training data• Learn a model from the training set

This the model you learned.

Page 43: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

How good is the prediction?• Randomly split some portion of your data Leave it aside as the test set• The remaining data is the training data• Learn a model from the training set• Estimate your future performance with the test data

Page 44: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train test set split

It is simpleWhat is the down side ?

Page 45: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

More data is betterWith more data you can learn better

Blue: Observed data Red: Predicted curve True: Green true distribution

Compare the predicted curves

Page 46: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train test set split

It is simpleWhat is the down side ?

1. You waste some portion of your data.

Page 47: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train test set split

It is simpleWhat is the down side ?

1. You waste some portion of your data.

What else?

Page 48: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train test set split

It is simpleWhat is the down side ?

1. You waste some portion of your data.2. You must be luck or unlucky with your test data

Page 49: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train test set split

It is simpleWhat is the down side ?

1. You waste some portion of your data.2. If you don’t have much data, you must be luck

or unlucky with your test data

How does it translate to statistics? Your estimator of performance has …?

Page 50: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Train/test set split

It is simpleWhat is the down side ?

1. You waste some portion of your data.2. If you don’t have much data, you must be luck

or unlucky with your test data

How does it translate to statistics? Your estimator of performance has high variance

Page 51: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Cross Validation

Recycle the data!

Page 52: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

LOOCV (Leave-one-out Cross Validation)

Your single test data

Let say we have N data pointsk be the index for data pointsk=1..N

Let (xk,yk) be the kth record

Temporarily remove (xk,yk) from the dataset

Train on the remaining N-1Datapoints

Test your error on (xk,yk)

Do this for each k=1..N and report the mean error.

Page 53: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

LOOCV (Leave-one-out Cross Validation)

There are N data points.. Do this N times. Notice thetest data is changing each time

Page 54: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

LOOCV (Leave-one-out Cross Validation)

There are N data points.. Do this N times. Notice thetest data is changing each time

MSE=3.33

Page 55: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

LOOCV (Leave-one-out Cross Validation)

Let say we have N data pointsk be the index for data pointsk=1..N

Let (xk,yk) be the kth record

Temporarily remove (xk,yk) from the dataset

Train on the remaining N-1datapoints

Test your error on (xk,yk)

Do this for each k=1..N and report the mean error.

Page 56: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

K-fold cross validation

k-fold

traintest

Train on (k - 1) splitsTest

In 3 fold cross validation, there are 3 runs.In 5 fold cross validation, there are 5 runs.In 10 fold cross validation, there are 10 runs.

the error is averaged over all runs

Page 57: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

Model SelectionIn-sample error estimates:

Akaike Information Criterion (AIC)Bayesian Information Criterion (BIC)Minimum Description Length Principle (MDL)Structural Risk Minimization (SRM)

Extra-sample error estimates: Cross-ValidationBootstrap

Most used method

Page 58: Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

References

• http://videolectures.net/mlas06_cohen_tc/• http://www.autonlab.org/tutorials/overfit.ht

ml