ai group-seminar-2013 nbc

22
Automated Classification of Short Message Service (SMS) ALOYSIUS OCHOLA [email protected] MAKERERE UNIVERSITY ARTIFICIAL INTELLIGENCE GROUP USING NAÏVE BAYES ALGORITHM Artificial Intelligence Seminar . May 30 . 2013

Upload: gen-aloys-ochola-badde

Post on 26-Jun-2015

267 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Ai group-seminar-2013 nbc

Automated Classification of Short Message Service (SMS)

ALOYSIUS OCHOLA

[email protected]

MAKERERE UNIVERSITY

ARTIFICIAL INTELLIGENCE GROUP

USING NAÏVE BAYES ALGORITHM

Artificial Intelligence Seminar . May 30 . 2013

Page 2: Ai group-seminar-2013 nbc

2Automated Classification of SMS using Naïve Bayes Algorithm

Classification

• A supervised learning technique that involves assigning a label to a set of

unlabeled input objects.

• Based on the number of classes present, there are two types of

classification:

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Binary classification; classifies the members of a given set of objects into one of

the two classes

– Multi-class classification; classifying instances into more than two classes.

• Unlike a better understood binary classification, the multiclass

classification is more complex and less researched.

Page 3: Ai group-seminar-2013 nbc

3Automated Classification of SMS using Naïve Bayes Algorithm

Text Classification/Categorization

• Text documents is one of the several areas where classification can

be applied.

• TC (text classification/categorization) is the application of

classification algorithms on documents of text in order to

automatically group them to predefined categories.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

automatically group them to predefined categories.

• How to represent text documents

– Preprocessing and feature selection

• How to build the classifier; compute a classification function.

– Training classifier and classifying

Page 4: Ai group-seminar-2013 nbc

4Automated Classification of SMS using Naïve Bayes Algorithm

Short Text Documents

• Normal documents like email, journals, etc are typically

large and are rich with content (natural languages).

– Easy to apply traditional classification approaches which rely on

word frequencies.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Unlike short text documents like SMS & Twitter messages,

Forum posts , etc where word occurrence is too small.

– Dealing with short text therefore shall require just a little more

than traditional techniques.

• Especially during preprocessing and feature selection

Page 5: Ai group-seminar-2013 nbc

5Automated Classification of SMS using Naïve Bayes Algorithm

Applications of TC

• Spam filtering, a process which tries to discern E-mail spam messages from

legitimate emails

• Email routing, sending an email sent to a general address to a specific address or

mailbox depending on topic.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Language identification, automatically determining the language of a text

• Genre classification, automatically determining the genre of a text.

• Movie reviewing, automatically classify them as good, bad and neutral.

• Etc . . .

Page 6: Ai group-seminar-2013 nbc

6Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing

• The data captured in real world is so noisy, inconsistent and has no quality.

Some cleaning and transformation required.

• Quality results from short text will see most of the major steps of text

preprocessing skipped and some selected ones modified.

• Tokenization and lowercasing: splitting text streams to tokens and forced

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Tokenization and lowercasing: splitting text streams to tokens and forced

lowercasing.

– Word boundary detection, using whitespace and punctuation

– Note: Prepared corpus was lowercased.

• Minor spell-correction: although there’s a growing culture of using short-

hands (not formal) in SMS texts, some spell corrections can still done.

Page 7: Ai group-seminar-2013 nbc

7Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing (cont)

– Regular expression replacer: replacing words used with apostrophes with their

matching regular expressions.

• list pairs of RE apostrophes-word and correction Ex.Willn’t : will not, didn’t : did not, . . .

– Repeat replacer: people are not often strictly grammatical. May write "I

looooove it" to emphasize the word "love“.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

looooove it" to emphasize the word "love“.

• Before replacing any characters from the supplied word

– Module replaces any word with more than two repeating characters to just two as no such words can

exist in the English vocabulary, for example “goooooooose” to “goose”.

» RE: [(\w*)(\w)\*(\w*)]

– And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied

word.

Page 8: Ai group-seminar-2013 nbc

8Automated Classification of SMS using Naïve Bayes Algorithm

Data Preprocessing (cont)

• Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]

to remove extra repeated characters from the word.

– Matches 0 or more starting characters (\w*)

– A single character (\w), followed by another instance of that character \2

– Then 0 or more ending characters (\w*)

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Then 0 or more ending characters (\w*)

• Stop-words filtering: process of removing most

frequent words that exist in a document.

– Looking-up into a file containing stop words and return

only words not in the file/dictionary.

Page 9: Ai group-seminar-2013 nbc

9Automated Classification of SMS using Naïve Bayes Algorithm

A Classifier

• A classifier is built on a function f which will determine a category of an

input feature vector x, given a fixed set of classes C={c1, c2,…,cn} and a

description of features xX

– where X is the feature space to the output class labels.

• In simple terms; f(x) C.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• In simple terms; f(x) C.

– where f(x) is the classification function whose domain is X and whose range is

C. The class labels C can be ordered or unordered (categorical)

• A classifier is expected to learn from learn from a set of N input-output

pairs or simply training data set and predict a class of unseen input. That is

to say, mapping X to C .CXf :

Page 10: Ai group-seminar-2013 nbc

10Automated Classification of SMS using Naïve Bayes Algorithm

Building the Text Classifier

• For the particular case, we will deal with a probabilistic text classifier ft based on

Naïve Bayes classification (NBC) Theorem.

• Building the classifier will therefore involve a recursive processes of creating a

functional classifier by training it with example data set (NB learning) and running

the trained classifier on unknown content to determine class membership for the

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

the trained classifier on unknown content to determine class membership for the

unknown content classification (Bayesian Classification).

• Probabilistic classifier, to predict the class membership of a certain new document

X, calculates the probability of a class C given that document, that is:

• -> XCP |

Page 11: Ai group-seminar-2013 nbc

11Automated Classification of SMS using Naïve Bayes Algorithm

Naïve Bayes Algorithm

• It is a simple probabilistic learning and classification methods built upon

the Bayes’ probabilistic theory.

• It assumes that the presence (or absence) of a particular feature of a class

is not related to the presence (or absence) of any other feature (naïve

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

assumption).

• Uses prior probability of each category given no information about

an item.

• Categorization produces a posterior probability distribution over

the possible categories given a description of an item.

CP

XCP |

Page 12: Ai group-seminar-2013 nbc

12Automated Classification of SMS using Naïve Bayes Algorithm

Naïve Bayes (NB) Probability Theorem

• Derived from the definition of conditional probability

– probability that an event will occur, when another event is known to occur or to have occurred.

• From the product rule, given events C and X.

0)(,)|( )(

)( XPXCP XP

XCP

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• given as:

• Bayes Rule:

• ->

0)(,)|( )( XPXCP XP

)().|()().|()( CPCXPXPXCPXCP

0)(,)|( )(

)().|( XPXCP XP

CPCXP

)( CXPXCP

P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed

Equation (1)

Page 13: Ai group-seminar-2013 nbc

13Automated Classification of SMS using Naïve Bayes Algorithm

Deriving NB Classification Algorithm

• Given a set of feature vectors for each possible class C, the task of the

NBC (NB classification) algorithm is to approximate the probability of new

input features X to be present in C , that is, the class posterior or simply

the greatest .)|( XCcP

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Assume C boolean random variables and a vector space X containing n

boolean attributes:

– If ci is the ith possible value of C and xk denotes the kth attribute of X

– Applying NB probability theorem (Equation (1)):

j

iikki

cjCPcjCxkXP

cCPcCxXPxXcCP

)().|(

)().|()|(

Equation (2)

Page 14: Ai group-seminar-2013 nbc

14Automated Classification of SMS using Naïve Bayes Algorithm

• NB conditional Independence Assumption: Features (term presence) are

independent of each other given the class. A new document of n features

can therefore be classified into one of C classes using equation (2) as:

• The aim of the classifier is to return the maximum posterior probability of

Deriving NBC Algorithm

n

kk CxPXCP

1

)|()|(

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• The aim of the classifier is to return the maximum posterior probability of

c, thus:

• Further, because the sample space (denominator) is always constant for all

the classes and does not depend on any class ci of C, the NBC theorem is

given as:

j k jkj

k ii

ci

cCxPcCP

cCPcCPcC

i )|()(

)()(maxarg

k ii

ci cCPcCPcC

i

)()(maxarg Equation (3)

Page 15: Ai group-seminar-2013 nbc

15Automated Classification of SMS using Naïve Bayes Algorithm

Training Naïve Bayes Text Classifier

• During the training process, the classification

function ft, extracts, selects the most useful

features from the example corpus and labels

them with their appropriate class.

– Construct and store a mapping of feature-set:label

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Construct and store a mapping of feature-set:label

pair sets (training dataset); which ft will learn from.

• feature-set is a list of preprocessed and unique term

occurrences from the document samples

• label is the known class of that feature-set.

Page 16: Ai group-seminar-2013 nbc

16Automated Classification of SMS using Naïve Bayes Algorithm

Feature Representation

• Features describes and represents texts in format suitable for further machine

processing.

• Final performance depends on how descriptive features are used for text

description.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

Supervised learning classifiers can use any sort of feature

URL, email address, punctuation, capitalization, dictionaries, network features

• Word based feature (Bag of Words): feature extraction process to transform the

plain documents, which are merely strings of text, into a feature set containing

the (frequency of) occurrence of each word that is usable by a classifier.

Page 17: Ai group-seminar-2013 nbc

17Automated Classification of SMS using Naïve Bayes Algorithm

Feature Selection

• Text collections have a large number of features yet some classifiers can’t deal with

a very larger number of features. Therefore performing feature Selection would

ensure reduced training time and improve performance as it eliminates noise from

features and avoids over fitting.

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• Term Weighting: Each term in a document vector must be associated with a value

(weight) which measures the importance of this term and denotes how much this

term contributes to the categorization task of the document.

– Depend on information theory; frequency count of every word

– chi-squared statistical distribution; score measure of bigram of each word per-label

Page 18: Ai group-seminar-2013 nbc

18Automated Classification of SMS using Naïve Bayes Algorithm

Text Classification

• One step classifier testing process of taking the built text classifier ft and running it

on unknown content to determine class membership for that content.

• New input (test) SMS stream is passed to the classifier.

• Preprocesses the stream and compares it with the set of pre-classified examples (training set).

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

Numerical underflow

• In equation (3), many conditional probabilities are multiplied one for each position of X

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-

point underflow.

• since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs

of the probabilities rather than multiplying them. Therefore, during text classification, a

normalized NBC equation (given bellow) is used.

nkik

c

cCxPciCPCi 1

)|(log)(logmaxarg

Page 19: Ai group-seminar-2013 nbc

19Automated Classification of SMS using Naïve Bayes Algorithm

Implementation Pseudo Algorithm

for a given unknown input document:

• break the input stream into word tokens

• preprocess the tokens

• for a given training set:

– count the number of documents in each class

– for every training document:• for each class:

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

• for each class:

– if a preprocessed token appears in the document:

increment the count for tokens

• for each class:

– for each preprocessed token

divide the token count by the total token count to get conditional probabilities

• return log conditional probabilities for each class

for all the individual class log conditional probabilities:

• compute a comparison of the probability values

return the class with the greatest probability (maximum likelihood hypothesis).

Page 20: Ai group-seminar-2013 nbc

20Automated Classification of SMS using Naïve Bayes Algorithm

Evaluation and Implementation Approach

• Evaluation: test SMS text documents to assess classifier

success on the prediction of the class .

• Implementation: complete text classification application

with user interactive interface.

testsofnumberTotal

edictionsCorrect

___

Pr_

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

– Natural Language Processing approach

• Natural Language ToolKit (NLTK) used with Python programming

language.

– NLTK is entirely self-contained and provides convenient functions and

wrappers that can be used as building blocks for common NLP tasks.

Page 21: Ai group-seminar-2013 nbc

21Automated Classification of SMS using Naïve Bayes Algorithm

BIBLIOGRAPHY

Automated Classification of Short Messaging Services (SMS)

Messages for Optimized Handling

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

[email protected]

Aloysius OcholaMsc. Computer Science Project

Makerere University Kampala (2013)

Page 22: Ai group-seminar-2013 nbc

22Automated Classification of SMS using Naïve Bayes Algorithm

DEMO . . .

• Training samples collected from manually categorized SMS

message compiled by Ureport, an SMS based opinion forum

• Problem:They receives up-to 10,000 SMS messages in a day

and are supposed to reply to all the messages, if it is relevant

AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

and worthy.

smsTextClassificationApplication