naïve bayes text classi cation

Naïve Bayes Text Classification

9 March 2021

cmpu 366 · Computational Linguistics

Machine learning is the area of computer science focused on the development and implementation of systems that improve as they encounter more data.

Machine learning has been central to advances in NLP for approximately the last 25 years.

Text classification

From: "Fabian Starr“ <[email protected]> Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!

Is this spam?

Who wrote each of the Federalist papers?

Mad dog A. Ham.

What’s the subject of this medical article? Antagonists and inhibitors

Blood supply

Chemistry

Drug therapy

Embryology

Epidemiology

…

…zany characters and richly applied satire, and some great plot twists

It was pathetic. The worst part about it was the boxing scenes…

…awesome caramel sauce and sweet toasty almonds. I love this place!

…awful pizza and ridiculously overpriced…

Are these reviews positive or negative?

Many problems take the form of text classification: Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language identification

Sentiment analysis

Part-of-speech tagging

Automatic essay grading

…

Text classification problems take this form: Input:

A document d

A fixed set of classes C = {c1, c2, …, cj}

Output:

A predicted class c ∈ C

We can build a classifier by writing rules. Look for combinations of words or other features

Spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy can be high if the rules are carefully refined by an expert.

But building and maintaining these rules is expensive.

Instead, like humans learn from experience, we make computers learn from data.

A supervised machine-learning text classification problem takes this form:

Input:

A document d

A fixed set of classes C = {c1, c2, …, cj}

A training set of m hand-labeled documents (d1, c1), …, (dm, cm)

Output:

A learned classifier γ : d → c

Supervised machine learning

Source: NLTK book

Features

A classification decision must rely on some observable evidence, which we encode as features.

Typical features include: Words (n-grams) present in the text

Frequency of words

Capitalization

Presence of named entities

Syntactic relations

Semantic relations

The simplest and most common features are Boolean, e.g., is the word present or not?

However, we can also have integer features like the number of times a word occurs.

The features we select depend on the task. Is a name masculine or feminine?

Last letter = …

What part-of-speech is a word, e.g., park or carbingly?

Is the word preceded by the? to?

Does the word end with -ly? -ness?

Is an email spam?

Does it contain generic Viagra?

Is the subject in all capital letters?

See features that were used by SpamAssassin: http://spamassassin.apache.org/old/tests_3_3_x.html

Feature engineering is the problem of deciding what features are relevant.

Approaches: Hand-crafted

Use expert knowledge to determine a small set of features that are likely to be relevant.

Kitchen sink

Give lots of features to the machine-learning algorithm and see what features are given greater weight and which are ignored

E.g., use each word in the document as a feature: has-cash: True

has-the: True

has-linguistics: False

…

Weighting the evidence

A classification decision involves reconciling multiple features with different levels of predictive power.

Different types of classifiers use different algorithms to:

Determine the weights of individual features to maximize correct predictions for the training data and

Compute the likelihood of a label for an input, using the feature weights.

Popular machine learning methods: Naïve Bayes

Decision tree

Maximum entropy (ME)

Hidden Markov model (HMM)

Neural networks, including deep learning

Support vector machine (SVM)

Naïve Bayes

Naïve Bayes is a simple classification method based on Bayes rule.

For text classification, we can use it with a simple representation of a document as a bag of words.

The bag of words representation

Figure from J&M, 3rd ed. draft sec. 6.1

The bag of words representation

γ( ) = cseen 2sweet 1whimsical 1recommend 1happy 1

... ...

Bayes rule and classification

Bayes rule relates conditional probabilities.

For a document d and a class c,

P(c ∣ d) =P(d ∣ c) ⋅ P(c)

P(d)Posterior

Likelihood

Evidence

Prior

To choose the most likely class, cMAP from the set of classes C, for a document d:

cMAP = argmaxc∈C

P(c ∣ d)

= argmaxc∈C

P(d ∣ c)P(c)P(d)

= argmaxc∈C

P(d ∣ c)P(c)

MAP is “maximum a posteriori” – the most likely class

Bayes rule

Dropping the denominator, which is the same for each class

But the number of training examples needed to calculate these estimates is exponentially large compared to the number of features:

O(|F|n · |C|) parameters

cMAP = argmaxc∈C

P(d ∣ c)P(c)

= argmaxc∈C

P( f1, f2, …, fn ∣ c)P(c)

Likelihood Prior

Document d is represented as features f1, …, fn

Fortunately, the “naïve” in “naive Bayes” isn’t (just) a value judgment; it’s a functional design choice.

The naïve Bayes assumption is that the features f1, …, fn are conditionally independent (of one another) given the class c.

This simplifies combining contributions of features; you just multiply their probabilities:

P(f1, …, fn | c) = P(f1 | c) · P(f2 | c) ⋯ P(fn | c)

cMAP = argmaxc∈C

P( f1, f2, …, fn ∣ c)P(c)

cNB = argmaxc∈C

P(c)n

∏i=1

P( fi ∣ c)

Returning to our “bag of words model”: Let positions = all word positions in a document

where wi is the word at position i.

cNB = argmaxc∈C

P(c) ∏i ∈ positions

P(wi ∣ c)

This class is the one our naïve Bayes text classifier returns

Naïve Bayes: Learning

We need to estimate the prior probability of each category, P(c) for each c ∈ C.

We can get the maximum-likelihood estimate for each c from the training corpus:

We also need the probability of each word (feature) given each category:

P(c) =doccount(C = c)

Ndoc

P(wi ∣ c) =count(wi, c)

∑w∈V count(w, c)Fraction of times word wi appears among all words in documents of topic c

In general, the more training data we can give the classifier, the better it will do. Er

ror r

ate

Training set size

Note that we have a big problem with zero counts! If we never saw the word fantastic in a document labeled as positive,

When we calculate

this one 0 will turn the whole estimate to 0!

P(fantastic ∣ positive) =count(fantastic, positive)∑w∈V count(w, positive)

= 0

cMAP = argmaxc∈C

P(c)n

∏i=1

P( fi ∣ c)

As we did with n-grams, we can use Laplace (add-1) smoothing:

→ P(wi ∣ c) =count(wi, c)

∑w∈V count(w, c)count(wi, c) + 1

(∑w∈V count(w, c)) + |V |

What about the unknown words – those that appear in the test data but not in the training data?

Ignore them! Just remove them from the test document.

We could build an unknown word model, but it wouldn’t generally help; it’s unlikely to help us to know which class has more unknown words.

Naïve Bayes and language modeling

Generative model for multinomial naïve Bayes

c = spam

w1 = Dear w2 = sir w3 = SEEKING w4 = YOUR …

Naïve Bayes classifiers can use any sort of feature URL, email address, dictionaries, network features

But if we have a feature corresponding to each word in the text, then each class in our naïve Bayes model is a unigram language model.

The probability of assigning each word: P(word | c)

The probability of assigning each sentence: P(s | c) = Π P(word | c)

Class positive

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

I love this fun film

0.1 0.1 0.05 0.01 0.1positive

P(s | positive) = 0.000 000 5

The probability of assigning each word: P(word | c)

The probability of assigning each sentence: P(s | c) = Π P(word | c)

Class positive

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

I love this fun film

0.1 0.1 0.05 0.01 0.1

0.2 0.001 0.01 0.005 0.1

positive

Class negative

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film P(s | positive) = 0.000 000 5

negative

P(s | negative) = 0.00 000 000 1

P(s | positive) > P(s | negative)

Oh, to be Bayesian and naïve!

Strengths of naïve Bayes classification:

The model is easy to understand and easy to implement

(compared with other classifiers!)

Training and classification are both fast

Requires modest storage space

Relatively robust to irrelevant features

If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results

Works well for many tasks

It’s a good, dependable baseline for classification that’s widely used in practice – but it’s not the best!

Weakness of naïve Bayes classification:

The bag-of-words representation ignores the sequential ordering of words

The independence assumption is inappropriate if there are strong conditional dependencies between the variables.

Strengths of naïve Bayes classification:

The model is easy to understand and easy to implement

(compared with other classifiers!)

Training and classification are both fast

Requires modest storage space

Relatively robust to irrelevant features

If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results

Works well for many tasks

The model may not be “right”, but often we’re interested in the accuracy of the classification, not of the probability estimates.

Evaluation

After choosing the parameters for the classifier – i.e., training it – we test how well it does on a test set of examples that weren’t used for training.

Precision, recall, and F-measure

Jurafsky & Martin asks us to imagine we’re the CEO of Delicious Pie Company and we want to know what people are tweeting about our pies (which are delicious).

We build a classifier to identify which tweets are about Delicious Pie Company.

2×2 confusion matrix


Classifier says it’s about us

Classifier says it’s not about us

It’s really about us It’s really NOT about us


Classifier says it’s about us

Classifier says it’s not about us

It’s really about us It’s really NOT about us

What percent of the tweets about us did we identify?

What percent of the tweets that we said were about us really were?

What percent of the tweets were identified correctly either way?

Accuracy sounds great – it’s consider how the classifier does on all inputs!

Well, it depends on the base (prior) probabilities: 99.99% accuracy might be terrible.

If we see 1 million tweets and only 100 of them are about Delicious Pie Company, we could just label every tweet “not about us”!

60% accuracy might be pretty good.

If we’re labeling documents with 20 different topics and the largest category only accounts for 10% of the data, that’s a much more difficult problem.

Instead, we measure precision and recall.

Precision is the percent of items the system detected (labeled as positive for a class) that are actually positive:

true positives / (true positives + false positives)

Recall is the percent of items actually present in the input that were correctly identified by the system:

true positives / (true positives + false negatives)

The classifier that says no tweets are about pie would have 99.99% accuracy – but 0% recall!

It doesn’t identify any of the 100 tweets we wanted.

There’s a trade-off between precision and recall. A highly precise classifier will ignore cases where it’s less confident, leading to more false negatives → lower recall

A high-recall classifier will flag things it’s unsure about, leading to more false positives → lower precision

In developing a real application, picking the right trade-off point between precision and recall is an important usability issue.

Think about a grammar checker: Too many false positives will irritate lots of users.

But if you’re designing a system to detect hate speech online, you might want to err on the side of high recall to avoid abuse slipping through the cracks.

Any balance of precision and recall can be encoded as a single measure called an F-score:

The most common F-score is F1, which is the harmonic mean of precision and recall:

Fβ =(β2 + 1)PR

β2P + R

F1 =2PR

P + R

Why do we use the harmonic mean rather than the mean?

Development test sets

We train on a training set and test on a test set.

But sometimes we also want a development test set. This avoids overfitting – “tuning to the test set” – and offers a more conservative estimate of performance.

Training set Development Test Set Test Set

Problem: We want as much data as possible for training and as much as possible for dev. How should we split it?

Cross-validation: multiple splits

We can pool results over splits, compute the pooled dev. performance.

How can we combine the precision or recall scores from three (or more) classes to get one metric?

Macroaveraging

Compute the performance for each class and then average over classes

Microaveraging

Collect decisions for all classes into one confusion matrix

Compute precision and recall from that table

Macroaveraging and microaveraging

Assignment 2: Who Said It?

Jane Austen or Herman Melville? I never met with a disposition more truly amiable.

But Queequeg, do you see, was a creature in the transition stage – neither caterpillar nor butterfly.

Oh, my sweet cardinals!

Task: build a Naïve Bayes classifier and explore it

Do three-way partition of data: test data

development-test data

training data

Acknowledgments

The lecture incorporates material from: Nae-Rae Han, University of Pittsburgh

Nancy Ide, Vassar College

Daniel Jurafsky, Stanford University

Daniel Jurafasky and James Martin, Speech and Language Processing

naïve bayes text classi cation

Documents