ai group-seminar-2013 nbc

Automated Classification of Short Message Service (SMS)

ALOYSIUS OCHOLA

[email protected]

MAKERERE UNIVERSITY

ARTIFICIAL INTELLIGENCE GROUP

USING NAÏVE BAYES ALGORITHM

Artificial Intelligence Seminar . May 30 . 2013

http://cit.mak.ac.ug/cs/aigroup/seminars.html

2Automated Classification of SMS using Naïve Bayes Algorithm

Classification

• A supervised learning technique that involves assigning a label to a set of

unlabeled input objects.

• Based on the number of classes present, there are two types of

classification:

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Binary classification; classifies the members of a given set of objects into one of

the two classes

– Multi-class classification; classifying instances into more than two classes.

• Unlike a better understood binary classification, the multiclass

classification is more complex and less researched.


Text Classification/Categorization

• Text documents is one of the several areas where classification can

be applied.

• TC (text classification/categorization) is the application of

classification algorithms on documents of text in order to

automatically group them to predefined categories.


automatically group them to predefined categories.

• How to represent text documents

– Preprocessing and feature selection

• How to build the classifier; compute a classification function.

– Training classifier and classifying


Short Text Documents

• Normal documents like email, journals, etc are typically

large and are rich with content (natural languages).

– Easy to apply traditional classification approaches which rely on

word frequencies.


• Unlike short text documents like SMS & Twitter messages,

Forum posts , etc where word occurrence is too small.

– Dealing with short text therefore shall require just a little more

than traditional techniques.

• Especially during preprocessing and feature selection


Applications of TC

• Spam filtering, a process which tries to discern E-mail spam messages from

legitimate emails

• Email routing, sending an email sent to a general address to a specific address or

mailbox depending on topic.


• Language identification, automatically determining the language of a text

• Genre classification, automatically determining the genre of a text.

• Movie reviewing, automatically classify them as good, bad and neutral.

• Etc . . .


Data Preprocessing

• The data captured in real world is so noisy, inconsistent and has no quality.

Some cleaning and transformation required.

• Quality results from short text will see most of the major steps of text

preprocessing skipped and some selected ones modified.

• Tokenization and lowercasing: splitting text streams to tokens and forced


• Tokenization and lowercasing: splitting text streams to tokens and forced

lowercasing.

– Word boundary detection, using whitespace and punctuation

– Note: Prepared corpus was lowercased.

• Minor spell-correction: although there’s a growing culture of using short-

hands (not formal) in SMS texts, some spell corrections can still done.


Data Preprocessing (cont)

– Regular expression replacer: replacing words used with apostrophes with their

matching regular expressions.

• list pairs of RE apostrophes-word and correction Ex.Willn’t : will not, didn’t : did not, . . .

– Repeat replacer: people are not often strictly grammatical. May write "I

looooove it" to emphasize the word "love“.


looooove it" to emphasize the word "love“.

• Before replacing any characters from the supplied word

– Module replaces any word with more than two repeating characters to just two as no such words can

exist in the English vocabulary, for example “goooooooose” to “goose”.

» RE: [(\w*)(\w)\*(\w*)]

– And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied

word.


Data Preprocessing (cont)

• Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]

to remove extra repeated characters from the word.

– Matches 0 or more starting characters (\w*)

– A single character (\w), followed by another instance of that character \2

– Then 0 or more ending characters (\w*)


– Then 0 or more ending characters (\w*)

• Stop-words filtering: process of removing most

frequent words that exist in a document.

– Looking-up into a file containing stop words and return

only words not in the file/dictionary.


A Classifier

• A classifier is built on a function f which will determine a category of an

input feature vector x, given a fixed set of classes C={c1, c2,…,cn} and a

description of features xX

– where X is the feature space to the output class labels.

• In simple terms; f(x) C.


• In simple terms; f(x) C.

– where f(x) is the classification function whose domain is X and whose range is

C. The class labels C can be ordered or unordered (categorical)

• A classifier is expected to learn from learn from a set of N input-output

pairs or simply training data set and predict a class of unseen input. That is

to say, mapping X to C .CXf :


Building the Text Classifier

• For the particular case, we will deal with a probabilistic text classifier ft based on

Naïve Bayes classification (NBC) Theorem.

• Building the classifier will therefore involve a recursive processes of creating a

functional classifier by training it with example data set (NB learning) and running

the trained classifier on unknown content to determine class membership for the


the trained classifier on unknown content to determine class membership for the

unknown content classification (Bayesian Classification).

• Probabilistic classifier, to predict the class membership of a certain new document

X, calculates the probability of a class C given that document, that is:

• -> XCP |


Naïve Bayes Algorithm

• It is a simple probabilistic learning and classification methods built upon

the Bayes’ probabilistic theory.

• It assumes that the presence (or absence) of a particular feature of a class

is not related to the presence (or absence) of any other feature (naïve


assumption).

• Uses prior probability of each category given no information about

an item.

• Categorization produces a posterior probability distribution over

the possible categories given a description of an item.

CP

XCP |


Naïve Bayes (NB) Probability Theorem

• Derived from the definition of conditional probability

– probability that an event will occur, when another event is known to occur or to have occurred.

• From the product rule, given events C and X.

0)(,)|( )(

)( XPXCP XP

XCP


• given as:

• Bayes Rule:

• ->

0)(,)|( )( XPXCP XP

)().|()().|()( CPCXPXPXCPXCP

0)(,)|( )(

)().|( XPXCP XP

CPCXP

)( CXPXCP

P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed

Equation (1)


Deriving NB Classification Algorithm

• Given a set of feature vectors for each possible class C, the task of the

NBC (NB classification) algorithm is to approximate the probability of new

input features X to be present in C , that is, the class posterior or simply

the greatest .)|( XCcP


• Assume C boolean random variables and a vector space X containing n

boolean attributes:

– If ci is the ith possible value of C and xk denotes the kth attribute of X

– Applying NB probability theorem (Equation (1)):

j

iikki

cjCPcjCxkXP

cCPcCxXPxXcCP

)().|(

)().|()|(

Equation (2)


• NB conditional Independence Assumption: Features (term presence) are

independent of each other given the class. A new document of n features

can therefore be classified into one of C classes using equation (2) as:

• The aim of the classifier is to return the maximum posterior probability of

Deriving NBC Algorithm

n

kk CxPXCP

1

)|()|(


• The aim of the classifier is to return the maximum posterior probability of

c, thus:

• Further, because the sample space (denominator) is always constant for all

the classes and does not depend on any class ci of C, the NBC theorem is

given as:

j k jkj

k ii

ci

cCxPcCP

cCPcCPcC

i )|()(

)()(maxarg

k ii

ci cCPcCPcC

i

)()(maxarg Equation (3)


Training Naïve Bayes Text Classifier

• During the training process, the classification

function ft, extracts, selects the most useful

features from the example corpus and labels

them with their appropriate class.

– Construct and store a mapping of feature-set:label


– Construct and store a mapping of feature-set:label

pair sets (training dataset); which ft will learn from.

• feature-set is a list of preprocessed and unique term

occurrences from the document samples

• label is the known class of that feature-set.


Feature Representation

• Features describes and represents texts in format suitable for further machine

processing.

• Final performance depends on how descriptive features are used for text

description.


Supervised learning classifiers can use any sort of feature

URL, email address, punctuation, capitalization, dictionaries, network features

• Word based feature (Bag of Words): feature extraction process to transform the

plain documents, which are merely strings of text, into a feature set containing

the (frequency of) occurrence of each word that is usable by a classifier.


Feature Selection

• Text collections have a large number of features yet some classifiers can’t deal with

a very larger number of features. Therefore performing feature Selection would

ensure reduced training time and improve performance as it eliminates noise from

features and avoids over fitting.


• Term Weighting: Each term in a document vector must be associated with a value

(weight) which measures the importance of this term and denotes how much this

term contributes to the categorization task of the document.

– Depend on information theory; frequency count of every word

– chi-squared statistical distribution; score measure of bigram of each word per-label


Text Classification

• One step classifier testing process of taking the built text classifier ft and running it

on unknown content to determine class membership for that content.

• New input (test) SMS stream is passed to the classifier.

• Preprocesses the stream and compares it with the set of pre-classified examples (training set).


Numerical underflow

• In equation (3), many conditional probabilities are multiplied one for each position of X

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-

point underflow.

• since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs

of the probabilities rather than multiplying them. Therefore, during text classification, a

normalized NBC equation (given bellow) is used.

nkik

c

cCxPciCPCi 1

)|(log)(logmaxarg


Implementation Pseudo Algorithm

for a given unknown input document:

• break the input stream into word tokens

• preprocess the tokens

• for a given training set:

– count the number of documents in each class

– for every training document:• for each class:


• for each class:

– if a preprocessed token appears in the document:

increment the count for tokens

• for each class:

– for each preprocessed token

divide the token count by the total token count to get conditional probabilities

• return log conditional probabilities for each class

for all the individual class log conditional probabilities:

• compute a comparison of the probability values

return the class with the greatest probability (maximum likelihood hypothesis).


Evaluation and Implementation Approach

• Evaluation: test SMS text documents to assess classifier

success on the prediction of the class .

• Implementation: complete text classification application

with user interactive interface.

testsofnumberTotal

edictionsCorrect

___

Pr_


– Natural Language Processing approach

• Natural Language ToolKit (NLTK) used with Python programming

language.

– NLTK is entirely self-contained and provides convenient functions and

wrappers that can be used as building blocks for common NLP tasks.


BIBLIOGRAPHY

Automated Classification of Short Messaging Services (SMS)

Messages for Optimized Handling


[email protected]

Aloysius OcholaMsc. Computer Science Project

Makerere University Kampala (2013)


DEMO . . .

• Training samples collected from manually categorized SMS

message compiled by Ureport, an SMS based opinion forum

• Problem:They receives up-to 10,000 SMS messages in a day

and are supposed to reply to all the messages, if it is relevant


and worthy.

smsTextClassificationApplication

http://www.ureport.ug/

ai group-seminar-2013 nbc

Education

text genre classification

documents of text

classification function

classification canbe

forcedai seminar muk

sms texts

splitting text streams

ending characters w