text classification

Elnaz Delpisheh

Introduction to Computational Linguistics

York UniversityDepartment of Computer Science and

EngineeringApril 24, 2023

Text Classification

OutlineDefinition and applicationsRepresenting texts

Pre-processing the textText classification methods

Naïve BayesVoted PerceptronSupport Vector MachinesDecision TreesK-nearest neighborRocchio’s algorithmNeural Networks

Performance evaluationSubjective text classification




Performance evaluationsubjective text classification

Text Classification-DefinitionText classification is the assignment of text

documents to one or more predefined categories based on their content.

The classifier:Input: a set of m hand-labeled documents (x1,y1),....,

(xm,ym)Output: a learned classifier f:x y

Classifier

Text document

Class A

Class B

Class C

Text docum

ent

Text docum

ent

Text Classification-ApplicationsClassify news stories as World, US, Business, SciTech,

Sports, Entertainment, Health, Other.Classify business names by industry.Classify student essays as A,B,C,D, or F. Classify email as Spam, Other.Classify email to tech staff as Mac, Windows, ..., Other.Classify pdf files as ResearchPaper, OtherClassify documents as WrittenByReagan, GhostWrittenClassify movie reviews as

Favorable,Unfavorable,Neutral.Classify technical papers as Interesting, Uninteresting.Classify jokes as Funny, NotFunny.Classify web sites of companies by Standard Industrial

Classification (SIC) code.

Text Classification-ExampleBest-studied benchmark: Reuters-21578

newswire stories9603 train, 3299 test documents, 80-100

words each, 93 classesARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds

and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).

• Maize Mar 48.0, total 48.0 (nil).• Sorghum nil (nil)• Oilseed export registrations were:• Sunflowerseed total 15.0 (7.9)• Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts, as follows....

Categories: grain, wheat (of 93 binary choices)

Text Classification-Representing Texts

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).• Maize Mar 48.0, total 48.0 (nil).• Sorghum nil (nil)• Oilseed export registrations were:• Sunflowerseed total 15.0 (7.9)• Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts, as follows....

f( )=y? What is the best representation

for the document x being classified?

simplest useful

Pre-processing the TextRemoving stop words

Punctuations, Prepositions, Pronouns, etc.Stemming

Walk, walker, walked, walking.IndexingDimensionality reduction

Representing text: a list of words

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of grains, oilseeds and their

products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).• Maize Mar 48.0, total 48.0 (nil).• Sorghum nil (nil)• Oilseed export registrations were:• Sunflowerseed total 15.0 (7.9)• Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts, as follows....

f( )=y

f( )=y(argentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …

Pre-processing the Text-IndexingUsing vector space model

Indexing (Cont.)

Indexing (Cont.)tfc-weighting

It considers the normalized length of documents (M).

ltc-weighting It considers the logarithm of the word frequency to

reduce the effect of large differences in frequencies.

Entropy weighting

Categories: grain, wheat

grain(s) 3oilseed(s) 2total 3wheat 1maize 1soybean 1tonnes 1

... ...

word freqARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONSBUENOS AIRES, Feb 26Argentine grain board figures show crop registrations of

grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

• Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).

• Maize Mar 48.0, total 48.0 (nil).• Sorghum nil (nil)• Oilseed export registrations were:• Sunflowerseed total 15.0 (7.9)• Soybean May 20.0, total 20.0 (nil)The board also detailed export registrations for subproducts,

as follows....

If the order of words doesn’t matter, x can be a vector of word frequencies. “Bag of words”: a long

sparse vector x=(,…,fi,….) where fi is the frequency of the i-th word in the vocabulary

Indexing-Word Frequency Weighting

Pre-processing the Text-Dimensionality ReductionFeature selection: It attempts to remove

non-informative words.Document frequency thresholdingInformation gain

Latent Semantic IndexingSingular Value Decomposition

Text Classification-MethodsNaïve BayesVoted PerceptronSupport Vector MachinesDecision TreesK-nearest neighborRocchio’s algorithmNeural Networks

Text Classification-Naïve BayesRepresent document x as list of words w1, w1,…

For each class y, build a probabilistic model Pr(X|Y=y) of “documents” Pr(X={argentine,grain...}|Y=wheat) = ....Pr(X={stocks,rose,in,heavy,...}|Y=nonWheat) = ....

To classify, find the y which was most likely to generate x—i.e., which gives x the best score according to Pr(x|y)f(x) = argmaxyPr(x|y)*Pr(y)

Text Classification-Naïve BayesHow to estimate Pr(X|Y) ?Simplest useful process to generate a bag of

words:pick word 1 according to Pr(W|Y)repeat for word 2, 3, ....each word is generated independently of the

others (which is clearly not true) but means

∏=

===n

iin yYwyYww

11 )|Pr()|,...,Pr(

How to estimate Pr(W|Y)?

Text Classification-Naïve BayesHow to estimate Pr(X|Y) ?

∏=

===n

iin yYwyYww

11 )|Pr()|,...,Pr(

myYmpyYwWyYwW

+=+==

===)count(

) and count()|Pr(

Estimate Pr(w|y) by looking at the data...

Terms:• This Pr(W|Y) is a multinomial distribution• This use of m and p is a Dirichlet prior for the multinomial

Text Classification-Naïve BayesHow to estimate Pr(X|Y) ?

∏=

===n

iin yYwyYww

11 )|Pr()|,...,Pr(

1)count(5.0) and count()|Pr(

+=+==

===yY

yYwWyYwW

This Pr(W|Y) is a multinomial distribution This use of m and p is a Dirichlet prior for the multinomial

for instance: m=1, p=0.5

Text Classification-Naïve Bayes

Putting this together:for each document xi with label yi

for each word wij in xi

count[wij][yi]++count[yi]++count++

to classify a new x=w1...wn, pick y with top score:

∑= +

++=

n

i

ik y

ywywwyscore1

1 1][count0.5]][[countlg

count][countlg)...,(

key point: we only need counts for words that actually appear in document x

Naïve Bayes for SPAM filtering

Naïve Bayes-SummaryPros:

Very fast and easy-to-implementWell-understood formally & experimentally

Cons:Seldom gives the very best performance“Probabilities” Pr(y|x) are not accurate

e.g., Pr(y|x) decreases with length of xProbabilities tend to be close to zero or one

The Curse of DimensionalityHow can you learn with so many

features?For efficiency (time & memory), use sparse

vectors.Use simple classifiers (linear or loglinear)Rely on wide margins.

Margin-based Learning

+++++

+++

+++ +

+ +

+

++

+

---

- --

-- -

--

-

-

- - - - -

--- -

- -- -

+

--

The number of features matters: but not if the margin is sufficiently wide and examples are sufficiently close to the origin (!!)

Text Classification-Voted Perceptron Text documents: X=< x1, x2,… ,xk > K= number of features Two classes={yes, no} W= < w1, w2,… ,wk > Objective:

Learn a weight vector W and a threshold θ If

Yes Otherwise

No

If the answer is Correct:

w1 ++ Incorrect: There is a mistake. Correction is made:

xk+1 = xk + xiwk k = k+1 wk+1 = 1

_____.___ = _ __________> ____

__=1

Voted Perceptron-Error Correction

Lessons of the Voted Perceptron

Voted Perceptron shows that you can make few mistakes in incrementally learning as you pass over the data, if the examples x are small (bounded by R), some u exists that has large margin.

Why not look for this line directly?

Support vector machines:

•find u to maximize the margin.

Text Classification-Support vector MachinesFacts about support vector machines:

the “support vectors” are the xi’s that touch the margin.

the classifier can be written as

where the xi’s are the support vectors.support vector machines often give very good results on topical text classification.

+ ++

++

+++ +

+ +++

+

-

--- -

--- -

--

-

- - --

+

____(________.___) ≥ __ ________.___

Support Vector Machine Results

Text Classification-Decision TreesObjective of decision tree learning

Learn a decision tree from a set of training data

The decision tree can be used to classify new examples

Decision tree learning algorithmsID3 (Quinlan, 1986)C4.5 (Quinlan, 1993)CART (Breiman, Friedman, et. al. 1983)etc.

Decision Trees-CARTCreating the tree

Splitting the set of training vectorsThe best splitter has the purest childrenDiversity measure is entropy

Where S: set of example; k:number of classes; P(Ci): the probability of examples in S that belong to Ci.

To find the best splitter each component of the document vector is considered in turn.

This process is repeated until no sets can be partitioned any further.

Each leaf is now assigned a class.

_ __(____) log__(____)__

__=1

Decision Trees-Example

TF-IDF RepresentationThe results above use a particular way to

represent documents: bag of words with TFIDF weighting“Bag of words”: a long sparse vector x=(,…,fi,….) where

fi is the “weight” of the i-th word in the vocabularyfor word w that appears in DF(w) documents out of N in

a collection, and appears TF(w) times in the documents being represented use weight:

also normalize all vector lengths (||x||) to 1

)(log)1)(log()( wDF

NwTFf wi ×+=

TF-IDF RepresentationTF-IDF representation is an old trick

from the information retrieval community, and often improves performance of other algorithms:K- nearest neighbor algorithm Rocchio’s algorithm

Text Classification-K-nearest NeighborThe nearest neighbor algorithm

ObjectiveTo classify a new object, find the object in the

training set that is most similar. Then assign the category of this nearest neighbor.

Determine the largest similarity with any element in the training set:

Collect the subset of X that has highest similarity with :

_________________ = __________________(___,___)

__= ____ _ _____________,____ = ____________(___)}

Text Classification-Rocchio’s Algorithmclassify using distance to centroid of

documents from each class

Assign test document to the class with maximum similarity

∑ ∑+ −

−=),( ),(z z

zz βαw

Support Vector Machine Results

Text Classification-Neural NetworksA neural network text classifier is a

collection of interconnected neurons representing documents that incrementally learn from its environment to categorize the documents.

A neuron

.

.

.

X1

X2

Xn

U =∑XiWi

y1

W1

W2

Wn

F(u)training

set

__(__) = 1 (1+___ __)_

y2y3

Text Classification-Neural Networks

H1

H2

Hm

.

.

.

.

.

.

IM1u11

uji

vj

IM2

IMn

Forward propagation

Forward and Backward propagation

Backward propagation

OS1(r)

OS2(r)

OS3(r)

Performance EvaluationPerformance of a classification algorithm

can be evaluated in the following aspectsPredictive performance

How accurate is the learned model in prediction?Complexity of the learned modelRun time

Time to build the modelTime to classify examples using the model

Here we focus on the predictive performance

Predictive Performance EvaluationHow predictive is the model?Most common performance measures

Classification accuracy on a datasetRecallPrecisionF-measureFalloutAccuracyError…

Classification error rate (1-accuracy)Error on the training data is not a good

indicator of performance on future data.

Performance MeasuresCorrect Incorre

ctAssign

eda b

Rejected

c d

________________= __+____+__+__+__

__________= __+____+__+__+ __

____= 2_ ___________________ ____________

__________________+____________

Performance Measures(Cont.)macro-averaging

Gives equal weight to each categoryComputes the performance measures per

categoryComputes the global means by averaging the

above resultmicro-averaging

Gives equal weight to each documentConstructs a global contingency table Computes performance measure

Macro/Micro-averaging Example

Class 1Yes

No

Call: yes

10 10

Call: no

10 970

Class 2Yes

No

Call: yes

90 10

Call: no

10 890

Pooled tableYes

No

Call: yes

100

20

Call: no

20 1860Macro-averaged precision:

[10/(10+10)+90/(10+90)]/2=(0.5+0.7)/2=0.7

Micro-averaged precision:100/(100+20)=0.83

Estimating Error RateHold-out estimation

Randomly split the data set into a training set and a test set, e.g., training set (2/3), test set (1/3)

Build model from the training set and estimate the error on the test set

Repeat hold-out estimationHold-out estimate can be made more reliable by

repeating the process with different random splits.For each split, an error rate is collected.An overall error rate is obtained by averaging the

error rates on the different splits.

Estimating Error Rate(cont.)Cross-validation

Randomly divide the data set into k subsets of equal size

Use k-1 subsets as training data and one subset as test data- do this k times using each subset in turn for testing

The error rates are averaged to yield an overall error estimate

This is called k-fold cross-validation Standard method for evaluation: 10-fold cross-

validationWhy 10? Extensive experiments have shown that this is

best choice to get an accurate estimate.

OverfittingOverfitting occurs when a model begins to

memorize the training data rather than learning to generalize from the model.

It can learn to perfectly predict the training data but it fails to predict the unseen data.

Why?Insufficient

examplesComplex modelToo many

examplesNoisy data

UnderfittingUnderfitting occurs when a model is too simple

to find the relationship in the input data.It fails to work well both on the training

data and the unseen data.




Performance evaluationSubjective text classification

Text Classification: Examples Classify news stories as World, US, Business, SciTech, Sports,

Entertainment, Health, Other: topical classification, few classes Classify email to tech staff as Mac, Windows, ..., Other: topical

classification, few classes Classify email as Spam, Other: topical classification, few classes

Adversary may try to defeat your categorization scheme Add MeSH terms to Medline abstracts

e.g. “Conscious Sedation” [E03.250] topical classification, many classes

Classify web sites of companies by Standard Industrial Classification (SIC) code. topical classification, many classes

Classify business names by industry. Classify student essays as A,B,C,D, or F. Classify pdf files as ResearchPaper, Other Classify documents as WrittenByReagan, GhostWritten Classify movie reviews as Favorable,Unfavorable,Neutral. Classify technical papers as Interesting, Uninteresting. Classify jokes as Funny, NotFunny.

Classifying Reviews as Favorable or Not

Dataset: 410 reviews from EpinionsAutos, Banks, Movies, Travel Destinations

Learning method:Extract 2-word phrases containing an adverb or

adjective (eg “unpredictable plot”)Classify reviews based on average Semantic

Orientation (SO) of phrases found:

Computed using queries to web search engine

Classifying Reviews as Favorable or Not

Classifying Movie Reviews

Idea: like Turney, focus on “polar” sections: subjective sentences

"Fearless" allegedly marks Li's last turn as a martial arts movie star--at 42, the ex-wushu champion-turned-actor is seeking a less strenuous on-camera life--and it's based on the life story of one of China's historical sports heroes, Huo Yuanjia. Huo, a genuine legend, lived from 1868-1910, and his exploits as a master of wushu (the general Chinese term for martial arts) raised national morale during the period when beleaguered China was derided as "The Sick Man of the East."

"Fearless" shows Huo's life story in highly fictionalized terms, though the movie's most dramatic sequence--at the final Shanghai tournament, where Huo takes on four international champs, one by one--is based on fact. It's a real old-fashioned movie epic, done in director Ronny Yu's ("The Bride with White Hair") usual flashy, Hong Kong-and-Hollywood style, laced with spectacular no-wires fights choreographed by that Bob Fosse of kung fu moves, Yuen Wo Ping ("Crouching Tiger" and "The Matrix"). Dramatically, it's on a simplistic level. But you can forgive any historical transgressions as long as the movie keeps roaring right along.

Classifying Movie Reviews [Pang et al, ACL 2004]

Dataset: Rotten Tomatoes (+), IMDB plot reviews (-)Apply ML to build a sentence classifier Try and force nearby sentences to have similar subjectivity: use methods to find minimum cut on a constructed graph

Classifying Movie Reviews

One vertex for each sentence

“subjective” “non subjective”Confidence in classifications

Edges indicate proximity

Classifying Movie Reviews [Pang et al, ACL 2004]

Pick class - vs + for v2, v3

Pick class + vs – for v1

Retained f(v2)=f(v3), but not f(v2)=f(v1)

Summary & ConclusionsThere are many, many

applications of text classification

Topical classification is fairly well understoodMost of the information is

in individual wordsVery fast and simple

methods work well

In many applications, classes are not topicsSentiment detectionSubjectivity/opinion

detection

In many applications, distinct classification decisions are interdependentReviews: Subjectivity of

nearby sentences

Lots of prior work to build on, lots of prior experimentation to consider

ReferencesK. Aas, L. Eikvil, Text Categorization: A Survey, June 1999.W. W. Cohen, Text Classification: An Advanced Tutorial.

Machinelearning Department, Carnegie Mellon, 2010.Manning, C.D., and Schutze, H. (1999). Foundations of

Statistical Natural Language Processing, MIT Press, Cambridge, MA.

A. C. Cachopo, A. L. Oliveira, An Emprical Comparison of Text Categorization Methods, String Processing and Information Retrieval, 10th International Symposium, Springer Verlag, 2003.

C. Donghui, L. Zhijing, A New Text Categorization Method based on HMM and SVM, International Symposium on Communications and Information Technologies, 2007.

W. Xue, X. Xu, Three New Feature Weighting Methods for Text categorization, Springer-Verlag Berlin Heidelberg ,2010.

References(Cont.)F. Xia, T. Jicun, L. Zhihui, A Text Categorization

Method Based on Local Document Frequency, IEEE Computer Society, 2009.

F. Sebastiani, Machine Learning in Automated text Categorization, ACM Computing Surveys, 2002.

K. Nigam, A. K. Mccallum, S. Thrun, T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Special issue on information retrieval ACM, 2000.

M. E. Ruiz, P. Srinivasan, Hierarchical Text Categorization using Neural Networks, Kluwer Academic Publications, Springer, 2002.

References(Cont.)J. Meng, H. Lin, A Two-stage Feature

Selection Method for Text Categorization, Seventh International Confeence on Fuzzy Systems and Knowledge Discovery, IEEE, 2010.

Thanks!

text classification

Documents

sunflowerseed total

assignment of text documents

crop registrations of

nilthe board

argentine grain board

otherclassify documents

test documents

documents x1