empirical research methods in computer science lecture 7 november 30, 2005 noah smith

Empirical Research Methods in Computer Science

Lecture 7November 30, 2005Noah Smith

Using Data

Action

Model

Dataestimation; regression; learning; training

classification; decision

pattern classificationmachine learning

statistical inference...

Probabilistic Models

Let X and Y be random variables.(continuous, discrete, structured, ...)

Goal: predict Y from X.

A model defines P(Y = y | X = x).

1. Where do models come from?2. If we have a model, how do we use it?

Using a Model

We want to classify a message, x, as spam or mail: y ε {spam, mail}.

ModelxP(spam | x)P(mail | x)

otherwisemail

x|mailPx|spamPifspamy

Bayes’ Rule

)x(P)y(P)y|x(P

)x|y(P

what we said the model must define

likelihood: one distribution over complex observations per y

prior

normalizes into a distribution: 'y

)'y|x(P)'y(P)x(P

Naive Bayes Models

Suppose X = (X1, X2, X3, ..., Xm).

Let

m

1ii )y|x(P)y|x(P

Naive Bayes: Graphical Model

Y

X1 X2 X3 Xm...

Part II

Where do the model parameters come from?

Using Data

Action

Model

Dataestimation; regression; learning; training

Warning

This is a HUGE topic. We will barely scratch the

surface.

Forms of Models

Recall that a model definesP(x | y) and P(y).

These can have a simple multinomial form, like

P(mail) = 0.545, P(spam) = 0.455

Or they can take on some other form, like a binomial, Gaussian, etc.

Example: Gaussian

Suppose y is {male, female}, and one observed variable is H, height.

P(H | male) ~ (μm, σm2)

P(H | female) ~ (μf, σf2)

How to estimate μm, σm2, μf, σf

2?

Maximum Likelihood

Pick the model that makes the data as likely as possible

max P(data | model)

Maximum Likelihood (Gaussian)

Estimating the parameters μm, σm

2, μf, σf2 can be seen as

fitting the data estimating an underlying statistic

(point estimate)

males#

hmaleyˆ

n

1iii

m

1males#

ˆhmaleyˆ

n

1i

2mii

2m

Using the model

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

p(H | male)p(H | female)

Using the model

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(male | H)P(female | H)

Example: Regression

Suppose y is actual runtime, and x is input length.

Regression tries to predict some continuous variables from others.

Regression

Linear: assume linear relationship, fit a line.

We can turn this into a model!

Linear Model

Given x, predict y.

y = β1x + β0 + (0, σ2)

true regression line random deviation

Principle of Least Squares

Minimize the sum of squared vertical deviations.

Unique, closed form solution!

vertical deviation

Other kinds of regression

transform one or both variables (e.g., take a log)

polynomial regression (least squares → linear system)

multivariate regression logistic regression

Example: text categorization

Bag-of-words model: x is a histogram of counts for all

words y is a topic

w

)x;w(countuni )y|w(p)y|x(P

MLE for Multinomials

“Count and Normalize”

)training(*;count)training;w(count

y|wpuni

The Truth about MLE

You will never see all the words.

For many models, MLE isn’t safe.

To understand why, consider a typical evaluation scenario.

Evaluation

Train your model on some data.

How good is the model?

Test on different data that the system never saw before. Why?

Tradeoff

overfits the training data low variance

doesn’t generalize low accuracy

Text categorization again

Suppose ‘v1@gra’ never appeared in any document in training, ever.

What is the above probability for a new document containing ‘v1@gra’ at test time?

w

)x;w(countuni )y|w(p)y|x(P

Solutions

Regularization Prefer less extreme parameters

Smoothing “Flatten out” the distribution

Bayesian Estimation Construct a prior over model

parameters, then train to maximize

P(data | model) × P(model)

One More Point

Building models is not the only way to be empirical. Neural networks, SVMs, instance-

based learning MLE and smoothed/Bayesian

estimation are not the only ways to estimate. Minimize error, for example

(“discriminative” estimation)

Assignment 3

Spam detection We provide a few thousand

examples Perform EDA and pick features Estimate probabilities Build a Naive-Bayes classifier

empirical research methods in computer science lecture 7 november 30, 2005 noah smith

Documents