naïve bayes text classi cation
TRANSCRIPT
Naïve Bayes Text Classification
9 March 2021
cmpu 366 · Computational Linguistics
Machine learning is the area of computer science focused on the development and implementation of systems that improve as they encounter more data.
Machine learning has been central to advances in NLP for approximately the last 25 years.
Text classification
From: "Fabian Starr“ <[email protected]> Subject: Hey! Sofware for the funny prices!
Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!
Is this spam?
Who wrote each of the Federalist papers?
Mad dog A. Ham.
What’s the subject of this medical article? Antagonists and inhibitors
Blood supply
Chemistry
Drug therapy
Embryology
Epidemiology
…
…zany characters and richly applied satire, and some great plot twists
It was pathetic. The worst part about it was the boxing scenes…
…awesome caramel sauce and sweet toasty almonds. I love this place!
…awful pizza and ridiculously overpriced…
Are these reviews positive or negative?
Many problems take the form of text classification: Assigning subject categories, topics, or genres
Spam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis
Part-of-speech tagging
Automatic essay grading
…
Text classification problems take this form: Input:
A document d
A fixed set of classes C = {c1, c2, …, cj}
Output:
A predicted class c ∈ C
We can build a classifier by writing rules. Look for combinations of words or other features
Spam: black-list-address OR (“dollars” AND “have been selected”)
Accuracy can be high if the rules are carefully refined by an expert.
But building and maintaining these rules is expensive.
Instead, like humans learn from experience, we make computers learn from data.
A supervised machine-learning text classification problem takes this form:
Input:
A document d
A fixed set of classes C = {c1, c2, …, cj}
A training set of m hand-labeled documents (d1, c1), …, (dm, cm)
Output:
A learned classifier γ : d → c
Supervised machine learning
Source: NLTK book
Features
A classification decision must rely on some observable evidence, which we encode as features.
Typical features include: Words (n-grams) present in the text
Frequency of words
Capitalization
Presence of named entities
Syntactic relations
Semantic relations
The simplest and most common features are Boolean, e.g., is the word present or not?
However, we can also have integer features like the number of times a word occurs.
The features we select depend on the task. Is a name masculine or feminine?
Last letter = …
What part-of-speech is a word, e.g., park or carbingly?
Is the word preceded by the? to?
Does the word end with -ly? -ness?
Is an email spam?
Does it contain generic Viagra?
Is the subject in all capital letters?
See features that were used by SpamAssassin: http://spamassassin.apache.org/old/tests_3_3_x.html
Feature engineering is the problem of deciding what features are relevant.
Approaches: Hand-crafted
Use expert knowledge to determine a small set of features that are likely to be relevant.
Kitchen sink
Give lots of features to the machine-learning algorithm and see what features are given greater weight and which are ignored
E.g., use each word in the document as a feature: has-cash: True
has-the: True
has-linguistics: False
…
Weighting the evidence
A classification decision involves reconciling multiple features with different levels of predictive power.
Different types of classifiers use different algorithms to:
Determine the weights of individual features to maximize correct predictions for the training data and
Compute the likelihood of a label for an input, using the feature weights.
Popular machine learning methods: Naïve Bayes
Decision tree
Maximum entropy (ME)
Hidden Markov model (HMM)
Neural networks, including deep learning
Support vector machine (SVM)
Naïve Bayes
Naïve Bayes is a simple classification method based on Bayes rule.
For text classification, we can use it with a simple representation of a document as a bag of words.
The bag of words representation
Figure from J&M, 3rd ed. draft sec. 6.1
The bag of words representation
Figure from J&M, 3rd ed. draft sec. 6.1
The bag of words representation
Figure from J&M, 3rd ed. draft sec. 6.1
The bag of words representation
γ( ) = cseen 2sweet 1whimsical 1recommend 1happy 1
... ...
Bayes rule and classification
Bayes rule relates conditional probabilities.
For a document d and a class c,
P(c ∣ d) =P(d ∣ c) ⋅ P(c)
P(d)Posterior
Likelihood
Evidence
Prior
To choose the most likely class, cMAP from the set of classes C, for a document d:
cMAP = argmaxc∈C
P(c ∣ d)
= argmaxc∈C
P(d ∣ c)P(c)P(d)
= argmaxc∈C
P(d ∣ c)P(c)
MAP is “maximum a posteriori” – the most likely class
Bayes rule
Dropping the denominator, which is the same for each class
But the number of training examples needed to calculate these estimates is exponentially large compared to the number of features:
O(|F|n · |C|) parameters
cMAP = argmaxc∈C
P(d ∣ c)P(c)
= argmaxc∈C
P( f1, f2, …, fn ∣ c)P(c)
Likelihood Prior
Document d is represented as features f1, …, fn
Fortunately, the “naïve” in “naive Bayes” isn’t (just) a value judgment; it’s a functional design choice.
The naïve Bayes assumption is that the features f1, …, fn are conditionally independent (of one another) given the class c.
This simplifies combining contributions of features; you just multiply their probabilities:
P(f1, …, fn | c) = P(f1 | c) · P(f2 | c) ⋯ P(fn | c)
cMAP = argmaxc∈C
P( f1, f2, …, fn ∣ c)P(c)
cNB = argmaxc∈C
P(c)n
∏i=1
P( fi ∣ c)
Returning to our “bag of words model”: Let positions = all word positions in a document
where wi is the word at position i.
cNB = argmaxc∈C
P(c) ∏i ∈ positions
P(wi ∣ c)
This class is the one our naïve Bayes text classifier returns
Naïve Bayes: Learning
We need to estimate the prior probability of each category, P(c) for each c ∈ C.
We can get the maximum-likelihood estimate for each c from the training corpus:
We also need the probability of each word (feature) given each category:
P(c) =doccount(C = c)
Ndoc
P(wi ∣ c) =count(wi, c)
∑w∈V count(w, c)Fraction of times word wi appears among all words in documents of topic c
In general, the more training data we can give the classifier, the better it will do. Er
ror r
ate
Training set size
Note that we have a big problem with zero counts! If we never saw the word fantastic in a document labeled as positive,
When we calculate
this one 0 will turn the whole estimate to 0!
P(fantastic ∣ positive) =count(fantastic, positive)∑w∈V count(w, positive)
= 0
cMAP = argmaxc∈C
P(c)n
∏i=1
P( fi ∣ c)
As we did with n-grams, we can use Laplace (add-1) smoothing:
→ P(wi ∣ c) =count(wi, c)
∑w∈V count(w, c)count(wi, c) + 1
(∑w∈V count(w, c)) + |V |
What about the unknown words – those that appear in the test data but not in the training data?
Ignore them! Just remove them from the test document.
We could build an unknown word model, but it wouldn’t generally help; it’s unlikely to help us to know which class has more unknown words.
Naïve Bayes and language modeling
Generative model for multinomial naïve Bayes
c = spam
w1 = Dear w2 = sir w3 = SEEKING w4 = YOUR …
Naïve Bayes classifiers can use any sort of feature URL, email address, dictionaries, network features
But if we have a feature corresponding to each word in the text, then each class in our naïve Bayes model is a unigram language model.
The probability of assigning each word: P(word | c)
The probability of assigning each sentence: P(s | c) = Π P(word | c)
Class positive
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
I love this fun film
0.1 0.1 0.05 0.01 0.1positive
P(s | positive) = 0.000 000 5
The probability of assigning each word: P(word | c)
The probability of assigning each sentence: P(s | c) = Π P(word | c)
Class positive
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
I love this fun film
0.1 0.1 0.05 0.01 0.1
0.2 0.001 0.01 0.005 0.1
positive
Class negative
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film P(s | positive) = 0.000 000 5
negative
P(s | negative) = 0.00 000 000 1
P(s | positive) > P(s | negative)
Oh, to be Bayesian and naïve!
Strengths of naïve Bayes classification:
The model is easy to understand and easy to implement
(compared with other classifiers!)
Training and classification are both fast
Requires modest storage space
Relatively robust to irrelevant features
If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results
Works well for many tasks
It’s a good, dependable baseline for classification that’s widely used in practice – but it’s not the best!
Weakness of naïve Bayes classification:
The bag-of-words representation ignores the sequential ordering of words
The independence assumption is inappropriate if there are strong conditional dependencies between the variables.
Strengths of naïve Bayes classification:
The model is easy to understand and easy to implement
(compared with other classifiers!)
Training and classification are both fast
Requires modest storage space
Relatively robust to irrelevant features
If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results
Works well for many tasks
The model may not be “right”, but often we’re interested in the accuracy of the classification, not of the probability estimates.
Evaluation
After choosing the parameters for the classifier – i.e., training it – we test how well it does on a test set of examples that weren’t used for training.
Precision, recall, and F-measure
Jurafsky & Martin asks us to imagine we’re the CEO of Delicious Pie Company and we want to know what people are tweeting about our pies (which are delicious).
We build a classifier to identify which tweets are about Delicious Pie Company.
2×2 confusion matrix
2×2 confusion matrix
Classifier says it’s about us
Classifier says it’s not about us
It’s really about us It’s really NOT about us
2×2 confusion matrix
Classifier says it’s about us
Classifier says it’s not about us
It’s really about us It’s really NOT about us
What percent of the tweets about us did we identify?
What percent of the tweets that we said were about us really were?
What percent of the tweets were identified correctly either way?
Accuracy sounds great – it’s consider how the classifier does on all inputs!
Well, it depends on the base (prior) probabilities: 99.99% accuracy might be terrible.
If we see 1 million tweets and only 100 of them are about Delicious Pie Company, we could just label every tweet “not about us”!
60% accuracy might be pretty good.
If we’re labeling documents with 20 different topics and the largest category only accounts for 10% of the data, that’s a much more difficult problem.
Instead, we measure precision and recall.
Precision is the percent of items the system detected (labeled as positive for a class) that are actually positive:
true positives / (true positives + false positives)
Recall is the percent of items actually present in the input that were correctly identified by the system:
true positives / (true positives + false negatives)
The classifier that says no tweets are about pie would have 99.99% accuracy – but 0% recall!
It doesn’t identify any of the 100 tweets we wanted.
There’s a trade-off between precision and recall. A highly precise classifier will ignore cases where it’s less confident, leading to more false negatives → lower recall
A high-recall classifier will flag things it’s unsure about, leading to more false positives → lower precision
In developing a real application, picking the right trade-off point between precision and recall is an important usability issue.
Think about a grammar checker: Too many false positives will irritate lots of users.
But if you’re designing a system to detect hate speech online, you might want to err on the side of high recall to avoid abuse slipping through the cracks.
Any balance of precision and recall can be encoded as a single measure called an F-score:
The most common F-score is F1, which is the harmonic mean of precision and recall:
Fβ =(β2 + 1)PR
β2P + R
F1 =2PR
P + R
Why do we use the harmonic mean rather than the mean?
Development test sets
We train on a training set and test on a test set.
But sometimes we also want a development test set. This avoids overfitting – “tuning to the test set” – and offers a more conservative estimate of performance.
Training set Development Test Set Test Set
Problem: We want as much data as possible for training and as much as possible for dev. How should we split it?
Cross-validation: multiple splits
We can pool results over splits, compute the pooled dev. performance.
3×3 confusion matrix
How can we combine the precision or recall scores from three (or more) classes to get one metric?
Macroaveraging
Compute the performance for each class and then average over classes
Microaveraging
Collect decisions for all classes into one confusion matrix
Compute precision and recall from that table
Macroaveraging and microaveraging
Assignment 2: Who Said It?
Jane Austen or Herman Melville? I never met with a disposition more truly amiable.
But Queequeg, do you see, was a creature in the transition stage – neither caterpillar nor butterfly.
Oh, my sweet cardinals!
Task: build a Naïve Bayes classifier and explore it
Do three-way partition of data: test data
development-test data
training data
Acknowledgments
The lecture incorporates material from: Nae-Rae Han, University of Pittsburgh
Nancy Ide, Vassar College
Daniel Jurafsky, Stanford University
Daniel Jurafasky and James Martin, Speech and Language Processing