supervised classification of text documents ravi. n. vivek shenoy t veer prakash s
TRANSCRIPT
SUPERVISED SUPERVISED CLASSIFICATION CLASSIFICATION
OF TEXT OF TEXT DOCUMENTSDOCUMENTSRavi. N.
Vivek Shenoy T
Veer Prakash S.
Contents
• Introduction
• Literature Review
• Document Representation
• Text Classifiers
• Implementation Aspects
• Results and Analysis
• Conclusion
• Future Enhancements
• References
• Current scenario of the Documents in the WEB.
• Structured data
• Unstructured data
• Information Retrieval (IR)• Deals with the representation, storage, organization and access to
information items.
• This representation of the information is used to manipulate the
unstructured data.
• Goal of IR• To provide users with those documents which satisfy their information
needs.
Objective of the project
• To classify the documents in the corpus. The documents thus
classified are then classified into various classes. A particular document
is assigned to a class if there is a relevance in the query and the
document.
• To provide a comparative study between two classifiers namely
- Centroid based classifier
- K- nearest Neighbour classifier
• Definition of Information Retrieval (IR) IR is finding material of an unstructured nature that satisfies an information
need from within large collections [28].
• Different fields of Information Retrieval (IR) There are 2 categories :
- General Applications of IR
- Domain Specific Applications
• IR Process The IR process is a 6 step process as shown in the next slide,
Problem recognition and acceptance
Query Formulation
Query Execution
Examination of the Result
Information Retrieval
Fig. Schematic representation of Information Retrieval
• Machine learning A system that can do autonomous acquisition, integration of knowledge
and is capable of learning from experience, analytical observation and other means,
resulting in a system that can continuously self-improve and thereby offer
increased efficiency and effectiveness [5].
•Types of Machine Learning :• Supervised Learning, in which the training data is labeled with the correct
answers, e.g.,“spam” .
• Unsupervised learning, in which user are given a collection of unlabeled
data, which have to be analyzed and patterns have to be discovered.
• Text Classification (Document Classification)• Classifying a set of documents into classes, this classification is based upon
the relevance of the document with the query given by the user [4].
•Types of Text Classification:•Supervised Learning : The training data is labeled with the correct
answers, e.g.,“spam”.
• Unsupervised Document Classification/ Document clustering
The classification must be done entirely without reference to
external information.
•Definition of Text Classification Let C = c1, c2, ... cm be a set of categories and D = d1, d2, ... dn a
set of documents. The task of the text classification consists in assigning to
each pair ( ci, dj ) of C x D (with 1 ≤ i ≤ m and 1 ≤ j ≤ n) a value of 0 or 1, i.e.
the value 0, if the document dj doesn't belong to ci. This mapping is done with
the help of a decision matrix [17].
•Phases of IR Development : There are several phases in the development of IR :
•1st Phase, 1950s - 1960s, of IR is the research phase
•2nd Phase, 1970s, IR struggled for adoption
•3rd Phase, 1980s- 1990s, reached acceptance phase in terms of free-text
search systems.
• Now-a-days the influence of IR is such that it is moving towards projects
in sound and image retrieval, along with electronic provision [26].
•Defination of TC by H.P. Luhn H.P. Luhn gave a definition for TC in 1958, this made the start of the text
classification era [32], the definition is as follows :
“…utilize data-processing machines for auto-abstracting and auto-encoding of
documents and for creating interest profiles for each of the ‘action points’ in an
organization. Both incoming and internally generated documents are
automatically abstracted, characterized by a word pattern, and sent
automatically to appropriate action points.”
• Need for DR• The task of information retrieval is to extract relevant documents from
a large collection of documents in response to user queries. The
documents contain primarily unrestricted text.
Document representation basically involves generating a representation R of
a document such that for any text items D1 and D2, R(D1) ≈ R(D2)
where R is a function knows as relevance of the document which is obtained
by matching the key words in the query with document set.
In order to reduce the complexity of the documents and make them clear and
easier to handle we transform the document from its full text version to a
document vector which describes the contents of the document.
The terms that occur in a document are the parameters of the document
representation. The types of parameters determine the type of the document
representation.
• Different Types:• Binary Document Representation
• Term Frequency Representation (Frequency vector)
• Probabilistic representation
Documents Document content No. of Unique words
D0 Gold silver truck 3
D1 Shipment of gold damaged in a fire 4
D2 Delivery of silver arrived in a silver truck 4
D3 Shipment of gold arrived in a truck 4
Example Documents :
•Binary Document Representation: The Binary Independence Model (BIM) introduces some simple
assumptions, Here, the term “binary'' is equivalent to Boolean, documents and
queries are both represented as binary term incidence vectors. That is, a
document “d” is represented by the vector =(x1,…xM) where ‘xt=1’ if term
‘t’ is present in document ‘d’ and ‘xt=0’ if ‘t’ is not present in ‘d’ [22].
x
Doc id Arrived Damaged Delivery Fire Gold Shipment Silver Truck
D0 0 0 0 0 1 0 1 1
D1 0 1 0 1 1 1 0 0
D2 1 0 1 0 0 0 1 1
D3 1 0 0 0 1 1 0 1
• Drawback :
It does not take into account the frequency of a term within a
document. All the terms with in a document will be given equal weightage
irrespective of the number of their occurrence in the document.
• Representation of the Example Documents :
•Term Frequency Representation (Frequency vector) In this type of representation we assign to each term in a document a
weight, that depends on the number of occurrences of the term in the document.
Docid Arrived Damaged Delivery Fire Gold Shipment Silver Truck
D0 0 0 0 0 1 0 1 1
D1 0 1 0 1 1 1 0 0
D2 1 0 1 0 0 0 2 1
D3 1 0 0 0 1 1 0 1
• Drawback :
This approach does not weigh the terms in a document with respect
to other documents in the dataset.
• Representation of the Example Documents :
• Probabilistic representation In this scheme every component of the vector denotes the probability
of occurrence of the corresponding term with in the document. The probability
of a particular term is found by the following
Probability =Number of occurrences of the term ‘t’ in the document ‘d’
Total number of terms in the document ‘d'
Doc id Arrived Damaged Delivery Fire Gold Shipment Silver Truck
D0 0 0 0 0 1/3 0 1/3 1/3
D1 0 ¼ 0 1/4 1/4 ¼ 0 0
D2 1/4 0 1/4 0 0 0 2/4 1/4
D3 1/4 0 0 0 1/4 ¼ 0 1/4
• Representation of the Example Documents :
• tf-idf (term frequency – inverse document frequency)
representation
The main idea behind tf-idf is that the term occurring
infrequently should be given a higher weight than a term that occurs
frequently.
•Important definitions in tf-idf context :
t = number of distinct terms in the document collection.
tfij = number of occurrences of term tj in document Di.
This is also referred to as term frequency.
dfj = number of documents which contain tj.
idfj = log( d/dfj) where d is the total number of documents.
This is the inverse document frequency
The weighting factor for each term in the document is calculated
by taking the product of term-frequency and inverse-document frequency
related to the term by using the following,
dij = tfij * idfj
• Weighting Factor of each term :
• tf-idft,d assigns to term t a weight in document d that is
1. highest when t occurs many times within a small number of documents (thus
lending high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many
documents (thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents. The values thus
computed are then filled into the document vectors,
Docid Arrived Damaged Delivery Fire Gold Shipment Silver Truck
D0 0 0 0 0 0.12 0 0.3 0.12
D1 0 0.3 0 0.6 0.12 0.3 0 0
D2 0.3 0 0.6 0 0 0 0.6 0.12
D3 0.3 0 0 0 0.12 0.3 0 0.12
• Representation of the Example Documents :
TC is defined as the task of approximating the unknown target function
Φ: D×C →T,F
Where Φ is called as the classifier [29], where,
C=c1,...,c|C|…………………………a predefined set of categories
D …………………………..…………..a (possibly infinite) set of documents.
If Φ(dj ,ci)= T,
then dj is called a positive example (or a member) of ci,
If Φ(dj ,ci)= F
then dj is called a negative example (or not a member) of ci.
In order to build a classifier we need a set Ω documents such that the
value of Φ(dj ,ci) is known for every (dj ,ci) € Ω ×C, usually Ω is partitioned
into three disjoint sets Tr (the training set), Va (the validation set), and Te (the
test set) [31].
• Refined Definition of Text Classification (TC) :
• Training set: The training set is the set of documents observing which the
learner builds the classifier.
• Validation set: The validation set is the set of documents on which the
engineer fine tunes the classifier, e.g. choosing for a parameter p on which the
classifier depends, the value that has yielded the best effectiveness when
evaluated on Va.
• Test set : The test set is the set on which the effectiveness of the classifier is
finally evaluated.
“evaluating the effectiveness” means running the classifier on a set of
pre-classified documents (Va or Te) and checking the degree of correspondence
between the output of the classifier and the pre-assigned classes.
•Types of classifiers The following are some of the classifiers [37],
• Naïve-Bayesian classifier
• kNN classifier .
• Linear Classifiers
• C4.5
• Support Vector Machines etc.
In this project we mainly concentrate on only 2 classifiers.
• Centroid classifier
• kNN classifier.
• CENTROID CLASSIFIER
• This type of a classifier computes a centroid vector for every pre-defined
class using all the training documents belonging to the class.
• Next, the test document (which must be classified) is compared with all
these centroid vectors to compute the similarity coefficients.
• Finally a class is chosen whose centroid nearly matches with that of the
test document (i.e. selecting that class whose similarity coefficient score is
the highest)
• Pseudo code of Centroid Classifier
Step 1) The input documents (under pre-defined categories) are split into training
set and testing set respectively.
Step 2) Scan through the entire training set to identify all the unique words
across the entire collection .The total count of the unique words decides
the length of the document vector.
Step 3) For each of the unique terms (as identified in step 2) ,compute the
document frequency (i.e. total number of documents in which a particular
unique terms occurs).
Step 4) Represent every input training document as a vector.(here we shall
assume that we are using tf-idf weights to represent the input documents.
Any of the representation schemes explained earlier can also be used).
Thus a document vector is represented as
dtf = ( tf1 log (N/df1) ,tf2 log (N/df2), tf3 log (N/df3) , …tfm log (N/dfm) )
Step 5) For every pre-defined class compute a centroid vector. this is done using the
following formula
where S is the training set of the category/class for which the centroid vector is
being computed.
Thus, if there are “m” different classes, there will be “m” different
centroid vectors. The “m” centroid vectors are denoted as
3) Based on similarity coefficient score , assign document x to the class, with
whom ,the score is the highest. It can be mathematically represented as
Step 6: For every test document “d”
1)Use the document frequencies of the various terms computed from the
training set , to compute the tf – idf representation of d i.e
2) Compute the similarity coefficient between and all the k- centroid vectors
using the normalised cosine measure. The cosine measure is computed as
follows
where ,
Thus using the above discussed formula’s, the classification of the
document can be done.
is any centroid vector of a class.
• K Nearest Neighbor Classifier
• It is one of the instance learning algorithm which has been applied to text
categorization.
• This classifier first computes k nearest neighbor’s of a test document .Then
the similarities of the test document to the k-nearest neighbors are
aggregated according to the class of the neighbors, and the test document
is assigned to the most similar class (as measured by aggregate similarity)
[37].
• Drawbacks :
•one test document must be compared with all the test documents, so
as to decide the class of the test document. Thus it requires huge
amount of computation.
•It uses all the features equally in computing similarities. This may
lead to poor similarity measures and may lead to classification
errors.
• Pseudo code of KNN Classifier
Step 1) The input documents (under pre-defined categories) are split into training
set and testing set respectively.
Step 2) Scan through the entire training set to identify all the unique words across
the entire collection. The total count of the unique words decides the
length of the document vector.
Step 3) Fix a value for k. This value determines the number of nearest neighbors
which will be considered during document classification.
Step 4) For every test document , compute the similarity coefficient with each of
the training documents and record the similarity score in a hash table.
Step 5) Select the top “k” scores from the hash.
Step 6) Compute the aggregate score for each class. If several of the k-nearest
neighbor’s share a class, then the per- neighbor weights of that class are
added together and resulting weighted sum is used as likelihood score of
that class. Sort the scores of candidate classes and generate a ranked list.
The decision rule can be mathematically represented as
Where,
• “d” is the test document which is being classified and KNN (d) indicates the
set of k-nearest neighbors of document d.
• (dj, ci ) represents the classification for document dj with respect to class ci .
Step 7) Test document “d” should be assigned to the class that has the highest
weighted aggregate score.
• PERL
Perl stands for “Practical Extraction and Reporting Language”. It
was invented by Larry Wall. Perl is very powerful, versatile scripting language
and it is famously known as “Swiss Army chainsaw of programming
languages”.
• PDL “Perl Data Language” • PDL is an object oriented extension to perl that is designed for scientific and
bulk numeric data processing and display. It is a very powerful and at the same
time fast array-oriented language.
•The PDL concept gives standard Perl, the ability to compactly store and
speedily manipulate the large N-dimensional data setswhich are very essential
for scientific computing.
• PDL uses Perl `objects' to hold piddle data. An `object' is like a user-defined
data-type and is a very powerful feature of Perl, PDL creates it's own class of
`PDL' objects to store piddles.
• PDL’s over perl variables•It is impossible to manipulate Perl `arrays' arithmetically as we like.
i.e.
@y = @x * 2
•Perl lists are intrinsically one-dimensional and we can have `lists of
lists' but this is not the same thing as a pdl.
•Perl lists do not support the range of datatypes that piddles do (byte
arrays, integer arrays, single precision, double precision, etc.)
•Perl lists consume a lot of memory. At least 20 bytes per number, of
which only a few bytes are used for storing the actual value. This is because
Perl lists are flexible, and can contain text strings as well as numbers.
•Perl lists are scattered about memory. The list data structure means
consecutive numbers are not stored in a neat block of consecutive memory
addresses as in case C and other programming language.
• Advantages of using Perl Data Language
• Both Perl and PDL are easily available, free of cost under the open source
license.
• Since PDL is an extension of perl, a perl programmer has all the powerful
features of perl at his hands.Thus even in mainly numerically oriented
programming, it is often extremely handy if we have access to non-numeric
functionality.
• Since it is a package of perl ,it makes PDL extensible and interoperable.
• Syntax associated with PDL is very simple thus making it a user friendly
package
pdl
Binary representatin(pdl elements indicates presence or absence of a word )
Term frequency representation(pdl elements indicates the frequency i.e. number of times a word occurring in a file)
ProbabilisticRepresentation(pdl elements indicates the probability of occurance of a word)
Tf-idfRepresentation(pdl elements indicate the product of term frequency and inverse document frequency)
• Usage of pdl in our project:
• Organization of our code
Classifiers
Centroid KNN
Binary representation
Term frequency
representation
Probabilistic representation
Tf-idfrepresentation
Textfiles(contains all the training and testing documents)
Freq(contains the files representing training and testing documents which indicates the frequency of a word in a file)
String(contains all the scripts and the result of classification)
Actuals(contains predefined files which indicate the class to which each of the file belongs )
• Scripts of our project There are 6 different scripts.
1) init.pl
2) main.pl
3) script1.pl
4) script2.pl
5) script3.pl
6) script4.pl
1) init.pl
This is the initialization script. The main intention behind this
script is make all the necessary folders available for the smooth functioning
of the code. It deletes the selected folders (for example, freq, source
code/results etc which holds all the necessary data ) and recreates them
again
2) main.pl
This is the main script. This script invokes all the other scripts
sequentilally.
3) script1.pl
The main intention of this script is removal of stopwords and
other unwanted characters from the source file .
Note:- We are not actually modifying the actual source file.
4) script2 .pl
The main intention of this script is to calculate the document
frequency for each of the unique terms depicted in uniquefile.txt
5) script3.pl
This script mainly performs the task of document classification.
6) script4.pl
The main intention of this script, is to generate a input to an html
browser, so as to display the results to the user.
The following are the elements of our Project :
1) Pre-defined classes -7
2) Training documents - 651
3) Testing Documents - 47
1) Pre-defined classes :
The following are the 7 pre-defined classes, S.no. Class Name No. of Documents
1. Cricket 101
2. Formula-1 90
3. Hockey 109
4. Ice-Hockey 109
5. Movies 122
6. Politics 20
7. Religion 100
• Results for Centroid Classifier
Representation
BinaryTerm
frequencyProbabilistic Tf-idfLevel of classification
Properly classified 36 38 43 33
Misclassfied 11 9 4 14
Accuracy 0.76 0.80 0.91 0.70
Fig. 6.1.2 Accuracy obtained by the Centroid classifier in each Document Representation
0.760.8
0.91
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Binary Term Frequency Probabilistic Tf-idf
Document Representation Scheme
Acc
urac
y O
btai
ned
Accuracy
• Level of Accuracy Achieved
• Results for KNN Classifier
The following table shows the result of applying KNN classifier
on the document vectors when the value of k=2 is given by the user.
Representation
BinaryTerm
frequencyProbabilistic Tf-idfLevel of classification
Properly classified 39 41 42 37
Misclassfied 08 06 05 10
Accuracy 0.83 0.87 0.89 0.78
• Level of Accuracy Achieved
Fig. 6.2.2. Accuracy obtained by the KNN (k=2) classifier in each Document Representation.
0.83
0.87
0.89
0.78
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
Binary Term Frequency Probabilistic Tf-idf
Document Representation Scheme
Acc
urac
y O
bta
ined
Accuracy
The following table shows the result of applying KNN classifier on the
document vectors when the value of k=20 is given by the user.
Representation
BinaryTerm
frequencyProbabilistic Tf-idfLevel of classification
Properly classified 44 43 42 37
Misclassfied 3 4 5 10
Accuracy 0.94 0.91 0.89 0.78
• Level of Accuracy Achieved
Accuracy Obtained in KNN (K=20) in each Document Representation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Binary Term Frequency Probabilistic Tf-idf
Classification
Acc
urac
y O
btai
ned
Accuracy
• Comparison of Centroid Classifier and the KNN classifier
Fig. 6.3.1 Comparison of the Centroid and KNN(k=2) classifier
0.760.8
0.91
0.7
0.830.87 0.89
0.78
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Binary Term Frequency Probabilistic Tf-idf
Document Representation
Acc
urac
y ob
tain
ed
Centroid
KNN
Drawbacks :
1) We cannot decide on the ideal value of K.
2) Requires huge amount of Computational Resources
3)Impracticle in case of very large Document Collections.
We find KNN Classifier provides the top notch results
in terms of classification accuracy.
About KNN Classifier
Advantages of Centroid Classifier over KNN
1) Does not require Huge amount of Computation.
2) Very quick to decide results of classification.
3) Ideally suited in case of very large input document
collection.
About Centroid Classifier
We also find that Centroid Classifier provides a
classification accuracy very near to that of KNN.
• To increase the number of classes
• To build a suitable front end
• To integrate the classifiers built to the search engine to provide classification of
websites
• To enhance the centroid classifier by implementing weighted centroid classifier.
• To incorporate a stemming algorithm ex. Stemmer porter.
• To upgrade the implementation to incorporate the standard data collections,
such as, Reuters-21578, TREC-5, TREC-6 and OHSUMED collection, 20 news
group data set.
Future Future EnhancementsEnhancements
ReferencesReferences
[1].Ricardo Bayeza-yates, Berthier ribeiro Neto, “Modern Information Retrieval”, Addison-Wesley-
Longman Publishing co., 1999
[2].Spoerri, A, “Information Processing & Management”. Proceedings of the IEEE First International
Conference on Computer Vision. Volume 43, pp. 1044-1058, 2007
[3]. Forrester, “Coping with complex data”, The forrester Report, pp.2-4, April 1995.
[4]. W. Bruce, “intelligent Information Retrieval”, Croft Center for Intelligent Information Retrieval
Computer Science Department University of Massachusetts, Amherst Amherst, D-Lib Magazine,
November 1995
[5] Simon Colton, ”AI Bite”, The Society for the Study of Artificial Intelligence and Simulation of
Behaviour, pp.66-67,