empirical research methods in computer science lecture 7 november 30, 2005 noah smith
TRANSCRIPT
Empirical Research Methods in Computer Science
Lecture 7November 30, 2005Noah Smith
Using Data
Action
Model
Dataestimation; regression; learning; training
classification; decision
pattern classificationmachine learning
statistical inference...
Probabilistic Models
Let X and Y be random variables.(continuous, discrete, structured, ...)
Goal: predict Y from X.
A model defines P(Y = y | X = x).
1. Where do models come from?2. If we have a model, how do we use it?
Using a Model
We want to classify a message, x, as spam or mail: y ε {spam, mail}.
ModelxP(spam | x)P(mail | x)
otherwisemail
x|mailPx|spamPifspamy
Bayes’ Rule
)x(P)y(P)y|x(P
)x|y(P
what we said the model must define
likelihood: one distribution over complex observations per y
prior
normalizes into a distribution: 'y
)'y|x(P)'y(P)x(P
Naive Bayes Models
Suppose X = (X1, X2, X3, ..., Xm).
Let
m
1ii )y|x(P)y|x(P
Naive Bayes: Graphical Model
Y
X1 X2 X3 Xm...
Part II
Where do the model parameters come from?
Using Data
Action
Model
Dataestimation; regression; learning; training
Warning
This is a HUGE topic. We will barely scratch the
surface.
Forms of Models
Recall that a model definesP(x | y) and P(y).
These can have a simple multinomial form, like
P(mail) = 0.545, P(spam) = 0.455
Or they can take on some other form, like a binomial, Gaussian, etc.
Example: Gaussian
Suppose y is {male, female}, and one observed variable is H, height.
P(H | male) ~ (μm, σm2)
P(H | female) ~ (μf, σf2)
How to estimate μm, σm2, μf, σf
2?
Maximum Likelihood
Pick the model that makes the data as likely as possible
max P(data | model)
Maximum Likelihood (Gaussian)
Estimating the parameters μm, σm
2, μf, σf2 can be seen as
fitting the data estimating an underlying statistic
(point estimate)
males#
hmaleyˆ
n
1iii
m
1males#
ˆhmaleyˆ
n
1i
2mii
2m
Using the model
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
p(H | male)p(H | female)
Using the model
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
P(male | H)P(female | H)
Example: Regression
Suppose y is actual runtime, and x is input length.
Regression tries to predict some continuous variables from others.
Regression
Linear: assume linear relationship, fit a line.
We can turn this into a model!
Linear Model
Given x, predict y.
y = β1x + β0 + (0, σ2)
true regression line random deviation
Principle of Least Squares
Minimize the sum of squared vertical deviations.
Unique, closed form solution!
vertical deviation
Other kinds of regression
transform one or both variables (e.g., take a log)
polynomial regression (least squares → linear system)
multivariate regression logistic regression
Example: text categorization
Bag-of-words model: x is a histogram of counts for all
words y is a topic
w
)x;w(countuni )y|w(p)y|x(P
MLE for Multinomials
“Count and Normalize”
)training(*;count)training;w(count
y|wpuni
The Truth about MLE
You will never see all the words.
For many models, MLE isn’t safe.
To understand why, consider a typical evaluation scenario.
Evaluation
Train your model on some data.
How good is the model?
Test on different data that the system never saw before. Why?
Tradeoff
overfits the training data low variance
doesn’t generalize low accuracy
Text categorization again
Suppose ‘v1@gra’ never appeared in any document in training, ever.
What is the above probability for a new document containing ‘v1@gra’ at test time?
w
)x;w(countuni )y|w(p)y|x(P
Solutions
Regularization Prefer less extreme parameters
Smoothing “Flatten out” the distribution
Bayesian Estimation Construct a prior over model
parameters, then train to maximize
P(data | model) × P(model)
One More Point
Building models is not the only way to be empirical. Neural networks, SVMs, instance-
based learning MLE and smoothed/Bayesian
estimation are not the only ways to estimate. Minimize error, for example
(“discriminative” estimation)
Assignment 3
Spam detection We provide a few thousand
examples Perform EDA and pick features Estimate probabilities Build a Naive-Bayes classifier