Table of Contents
Chapter 6. Applications to Text Mining ....................................................................................... 245 6.1. Centroid-based Text Classification ..................................................................................... 247
6.1.1. Formulation of centroid-based text classification ..................................................... 248 6.1.2. Effect of Term distributions........................................................................................ 251 6.1.3. Experimental Settings and Results ............................................................................. 253
6.2. Document Relation Extraction ........................................................................................... 258 6.2.1. Document Relation Discovery using Frequent Itemset Mining ................................. 259 6.2.2. Empirical Evaluation using Citation Information ........................................................ 259 6.2.3. Experimental Settings and Results ............................................................................. 264
6.3. Application to Automatic Thai Unknown Detection .......................................................... 269 6.3.1. Thai Unknown Words as Word Segmentation Problem ............................................ 271 6.3.2. The Proposed Method ................................................................................................ 271 6.3.3. Experimental Settings and Results ............................................................................. 280
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
245
Chapter 6. Applications to Text Mining
As one application of data mining, text mining is a knowledge-intensive process that deals with a
document collection over time by a set of analysis natural language processing tools. Text mining
seeks to extract useful information from a large pile of textual data sources through the identification
and exploration of interesting patterns. The data sources can be electronic documents, email, web
documents or any textual collections, and interesting patterns are found not in formalized database
records but, instead, in the unstructured textual data in the documents in these collections. It is quite
common that text mining and data mining share many high-level architectural similarities, including
preprocessing routines, pattern-discovery algorithms, and visualization tools for presenting mining
results. While data mining assumes that data have already been stored in a structured format with
preprocessing of data cleansing and transformation, text mining deals with preprocessing of feature
extraction, i.e., usually keywords from natural language documents. The number of features in text
mining seems much larger than that in data mining since the features in text mining involves words,
which are highly various. Most text mining works exploits techniques and methodologies from the
areas of information retrieval, information extraction, and corpus-based computational linguistics.
This chapter presents three examples of text mining applications: text classification, document
relation extraction and unknown word detection in Thai language. The original literatures related to
these three applications can be found in (Lertnattee and Theeramunkong, 2004a), (Sriphaew and
Theeramunkong, 2007a) and (TeCho et. al, 2009b).
Before explanation of the applications, some basic concepts of text processing are provided as
follows. Towards text mining, several preprocessing techniques have been proposed to transform
structured document representations from raw textual data. Most techniques aims to use and
produce domain-independent linguistic features with natural language processing (NLP) techniques.
There are also text categorization and information extraction (IE) techniques, which directly deal
with the domain-specific knowledge. Note that a document is an abstract object. Therefore, we can
have a variety of possible actual representations for it. To exploit information in documents, we need
a so-called document structuring process which transforms raw representation to some kinds of
structured representation. To solve this task, at least three subtasks need to be solved; (1) text pre-
processing task, (2) problem-independent task, and (3) problem-dependent task. As the first subtask,
text pre-processing converts raw representation into a structure suitable for further linguistic
processing. For example, when the raw input is a document image or a recorded speech, pre-
processing is to convert the raw input into a stream of text, sometimes with text structures such as
paragraphs, columns and tables, as well as some document-level fields such as author, title, and
abstract by visual presentation. To convert document images to texts, optical character recognition
(OCR) is used while speech recognition can be applied to transform audio speeches into texts. As the
second subtask, the problem-independent tasks process text documents using general knowledge on
natural language. The tasks may include word segmentation or tokenization, morphological analysis,
POS tagging, and syntactic parsing in either shallow or deep processing. The output of these tasks is
not specific for any particular problem, but typically employed for further problem-dependent
processing. The domain-related knowledge, however, can often enhance performance of general-
purpose NLP tasks and is often used at different levels of processing. As the last step, the problem-
dependent tasks attempt to output final representation suitable for the concerned task for text
categorization, information extraction, etc. However, up to now it has been shown that different
analysis levels, including phonetic, morphological, syntactical, semantical, and pragmatical, occur
simultaneously and depend on each other. Even now how human process a langauge is still
unrevealed. Some works have tried to combine such levels into one single process but still have not
yet achieved a level of satisfactory. Therefore most of text understanding methods use the divide-
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
246
and-conquer strategy, separating the whole problem into several subtasks and solving them
independently as follows.
Tokenization and Word Segmentation
The important step towords text analysis is to break down a continuous character stream into
meaningful constituents, such as chapters, sections, paragraphs, sentences, words, and even syllables
or phonemes. Tokenization is a process to break the text into sentences and words. In English, the
main challenge is to identify sentence boundaries since a period can be used as the end of a sentence
and a part of a previous token like Dr., Mr., Ms., Prof., St., No. and so on. In general a tokenizer may
extract token features, such as types of capitalization, inclusion of digits, punctuation, special
characters, and so on. These features usually describe some superficial property of the sequence of
characters that make up the token. For languages without explicit word boundaries such as Thai,
Japanese, Korean and Chinese, word segmentation is necessary. This processing is very important to
construct the fundamental units for processing such languages.
Part-of-Speech (POS) Tagging
POS tagging is the process to assign a word type (category) for each word in a sentence with the
appropriate POS tags based on the context they appear. The POS tag of a word specifies the role the
word plays in the sentence where it appear. It also provides the initial information related to the
semantic content of a word. Among several works, the most common set of tags includes seven
different tags, i.e., article, noun, verb, adjective, preposition, number, and proper noun. Some systems
contain a much more elaborate set of tags. For instance, there have been at least 87 basic tags in the
complete Brown Corpus. More types of tags means more detailed analysis.
Syntactical Parsing
Syntactical parsing is a process that applies a grammar to detect the structure of a sentence. In the
sentence structure, common constituents in grammars include noun phrases, verb phrases,
prepositional phrases, adjective phrases, and subordinate clauses. Following grammar rules, each
phrase or clause may consist of smaller phrases or words. For deeper analysis, the syntactical
structure of sentences may also elaborate the roles of different phrases, such as a noun phrase as a
subject, an object, or a complement. In the grammar, it is also possible to specify dependency among
phrases or clauses at several different levels. After analyzing a sentence, the output can be
represented as a sentence graph with connected components.
Shallow Parsing
In real situation, it is not easy to fully analyze the structure of a sentence since language usage is
sometimes complicated and flexible. Therefore it is almost impossible to construct a grammar that
covers all cases. Moreover, while we try to revise a grammar to cover special cases, as a by-product a
lot of ambiguity will be triggered in the grammar. Such ambiguity needs to be solved by higher
process, such as semantic or pragmatic processing. By this situation, normally traditional algorithms
are computationally expensive to process a large number of sentences in a very large corpus. They
are also not robust enough. Instead of full analysis, shallow parsing is a practical alternative since it
will not perform a complete analysis of a whole sentence but only treat some parts in the sentence
that are simple and unambiguous. For example, shallow parsing finds only small and simple noun
and verb phrases, but not complex clauses. Therefore we can compromise speed and robustness of
processing by sacrificing depth of analysis. Most prominent dependencies might be formed, but
unclear and ambiguous ones are left unresolved.
For the purposes of information extraction, shallow parsing is usually sufficient and therefore
preferable to full analysis because of its far greater speed and robustness.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
247
Problem-Dependent Tasks
After text preprocessing and problem-independent processing, the final stage is to create meaningful
representations for either later more sophisticated processing. Normally to process a text,
documents are expected to be represented as sets of features. Two common applications for text
mining are text categorization (TC) and information extraction (IE). Both of these applications need a
tagging (and sometimes parsing) process. TC and IE enable users to move from a machine-readable
representation of the documents to a machine- understandable form of the documents.
Text categorization or text classification is a task to assign a category (also called a class) to each
document, such as giving the class ‘political’ to a political news. The number of groups depends on
the user preference. The set of all possible categories is usually manually predefined beforehand. All
categories are usually unrelated. However, recently there are multidimensional text classification or
multi-class text classification has been explored intensively. Information Extraction (IE) is a task to
discover important constituents in a text, such as what, where, who, whom, when (5W). Without IE
techniques, we would have much more limited knowledge discovery capabilities. IE is different to
information retrieval (IR) which perform search. Information retrieval just discover documents
relavant to a given query and let the user read the whole document. IE, on the other hand, aims to
extract the relevant information and present it in a structured format, such as a table. IE can help us
save time for reading the whole document by providing essential information in a structured form.
6.1. Centroid-based Text Classification
With the fast growth of online text information, there has been extreme need to find and organize
relevant information in text documents. For this purpose, it is known that automatic text
categorization (also known as text classification) becomes a significant tool to utilize text
documents efficiently and effectively. As an application, it can improve text retrieval as it allows
find class-based retrieval instead of full retrieval. Given statistics acquired from a training set of
labeled documents, text categorization is a method to use these statistics to assign a class label to
a new document. In the past, a variety of classification models were developed in different
schemes, such as probabilistic models (i.e., Bayesian classification), decision trees and rules,
regression models, example-based models (e.g., k-nearest neighbor or k-NN), linear models,
support vector machine, neural networks and so on. Among these methods, a variant of linear
models called a centroid-based or linear discriminant model is attractive since it has relatively
less computation than other methods in both the learning and classification stages. The
traditional centroid-based method can be viewed as a specialization of so-called Rocchio method
proposed by Rocchio (1971) and used in several works on text categorization (Joachims, 1997).
Based on the vector space model, a centroid-based method computes beforehand, for each class
(category), an explicit profile (or class prototype), which is a centroid vector for all positive
training documents of that category. The classification task is to find the most similar class to the
vector of the document we would like to classify, for example by the means of cosine similarity.
Despite the less computation time, centroid-based methods were shown to achieve relatively
high classification accuracy. In a centroid-based model, an individual class is modeled by
weighting terms appearing in training documents assigned to the class. This makes classification
performance of the model strongly depend on the weighting method applied in the model. Most
previous works of centroid-based classification focused on weighting factors related to frequency
patterns of words or documents in the class. Moreover, they are often obtained from statistics
within a class (i.e., positive examples of the class). The most popular factors are term frequency
(tf) and inverse document frequent (idf).
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
248
Text categorization or text classification (TC) is a task of assigning a Boolean value to each
pair where is a domain of documents and
is a set of predefined categories. A value of T (i.e., true) is assigned to
when the document is determined to belong to the category . On the other hand, a value of F
(i.e., false) is assigned to when the document is determined not to belong to the
category . In general, text classification is composed of two main phases, called model training
phase and classification phase. In the training phase, the task is to approximate the unknown
target function that describes how documents should be classified. Based on a
training set, a function called the classifier (also called rule, hypothesis, or model) is acquired
as the result of approximation. A good classifier is a model that coincides with the target function
as much as possible.
The TC task discussed above is general. Anyway, there are some additional factors or
constraints possible for this task. They include single-label vs. multi-label, category-pivoted vs.
document-pivoted and hard vs. ranking classification. Single-label classification assigns exactly
one category to each while multi-label classification may give more than one categories to
the same . A special case of single-label TC is binary TC where each must be
assigned either to category or to its complement . From the pivot aspect, there are two
different ways of using a text classifier. Given ., the task to find all that the document
dj belongs to is called document-pivoted classification. Alternatively, given , the task to find
all that the document belongs to is named category-pivoted classification. This
distinction is more pragmatic than conceptual and it occurs when the sets C and D might not be
available in their entirety right from the scratch. Lastly, hard categorization is to assign T or F
decision for each pair while ranking categorization is to rank the categories in C
according to their estimated appropriateness to , without taking any hard decision on any of
them. The task of ranking categorization is to approximate the unknown target function
by generating a classifier that matches with the target
function as much as possible. The result is to assign a number between 0 and 1 to each
pair . This value represents the likelihood the document is classified into the
category . Finally, for each , a ranked list of categories is obtained. This list would be of
great help to a human expert to make the final categorization decision. By these definitions, the
focused task in this work is evaluated as single-label, category-pivoted and hard classification.
6.1.1. Formulation of centroid-based text classification
In centroid-based text categorization, an explicit profile of a class (also called a class prototype)
is calculated and used as the representative of all positive documents of the class. The
classification task is to find the most similar class to the document we would like to classify, by
way of comparing the document with the class prototype of the focused class. This approach is
characterized by at least three factors; (1) representation basics, (2) class prototype
construction: term weighting and normalization, and (3) classification execution: query
weighting and similarity definition. Their details are described in the rest of this section.
Representation Basics
The frequently used document representation in IR and TC is the so-called bag of words (BOW)
where words in a document are used as basics for representing that document. There are also
some works that use additional information such as word position and word sequence in the
representation. In the centroid-based text categorization, a document (or a class) is represented
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
249
by a vector using a vector space model with BOW. In this representation, each element (or
feature) in the vector is equivalent to a unique word with a weight. The method to give a weight
to a word is varied work by work as described in the following section. In a more general
framework, the concept of n-gram can be applied. Instead of a single isolated word, a sequence of
n words will be used as representation basics. In several applications, not specific for
classification, the most popular n-grams are 1-gram (unigram), 2-gram (bigram) and 3-gram
(trigram). Alternatively, the combination of different n-grams, for instance the combination of
unigram and bigram, can also be applied. The n-grams or their combinations form a set of so-
called terms that are used for representing a document. Although a higher gram provides more
information and this may affect in improving classification accuracy, more training data and
computational power are required
Class Prototype Construction: Term Weighting and Normalization
Once we obtain a set of terms in a document, it is necessary to represent them
numerically. Towards this, term weighting is applied to set a level of contribution of a
term to a document. In the past, most of existing works applied term frequency (tf) and
inverse document frequency (idf) in the form of for representing a document.
In the vector space model, given a set of documents , a document
is represented by a vector , where is a weight assigned to a
term in the document. Here, assume that there are m unique terms in the universe.
The representation of the document is defined as follows.
In this definition, is term frequency of a term in a document and is defined
as . Here, is the total number of documents in a collection and is the
number of documents, which contain the term . Three alternative types of term
frequency are (1) occurrence frequency, (2) augmented normalized term frequency
and (3) binary term frequency. The occurrence frequency, the simplest and intuitive
one, corresponds to the number of occurrence of the term in a document. The
augmented normalized term frequency is defined by where is
the occurrence frequency and is the maximum term frequency in a document.
This compensates for relatively high term frequency in the case of long documents. It
works well when there are many technical meaningful terms in documents. The binary
term frequency is nothing more than 1 for presence and 0 for absence of the term in the
document. Term frequency alone may not be enough to represent the contribution of a
term in a document. To achieve a better performance, the well-known inverse
document frequency can be applied to eliminate the impact of frequent terms that exist
in almost all documents.
Besides term weighting, normalization is another important factor to represent a
document or a class. Without normalization, the classification result will strongly
depend on the document length. A long document is likely to be selected, compared to a
short document since it usually includes higher term frequencies and more unique
terms in document representation. The higher term frequency of a long document will
increase the average contribution of its terms to the similarity between the document
and the query. More unique terms also increase the similarity and chances of retrieval
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
250
of longer documents in preference over shorter documents. To solve this issue,
normally all relevant documents should be treated as equally important for
classification or retrieval. Normalization by document length is incorporated into term
weighting formula to equalize the length of document vectors. Although there are
several normalization techniques including cosine normalization and byte length
normalization, the cosine normalization is the most commonly used. It can solve the
problem of overweighting due to both higher term frequency and more unique terms.
The cosine normalization is done by dividing all elements in a vector with the length of
the vector, that is
where is the weight of the term before normalization.
Given a class with a set of its assigned documents, there are two possible alternatives to
create a class prototype. One is to normalize each document vector in a class before summing up
all document vectors to form a class prototype vector (normalization then merging). The other is
to sum up all document vectors before normalizing the result vector (merging then
normalization). The latter one is also called a prototype vector, which is invariant to the number
of documents per class. However, both methods obtain high classification accuracy with small
time complexity. The class prototype can be derived as follows. Let is a document
vector belonging to the class } be a set document vectors assigned to the class . Here, a class
prototype is obtained by summing up all document vectors in and then normalizing the
result by its size as follows.
Classification execution: query weighting and similarity definition
The last but not least important factors are query weighting and similarity definition.
For query weighting, term weighting described above can also be applied to a query or
a test document (i.e., a document to be classified). The simple term weighting for a
query is . In the same way as class prototype construction, there are three
possible types of term frequency; occurrence frequency, augmented normalized term
frequency and binary term frequency. Once a class prototype vector and a query vector
have been constructed, the similarity between these two vectors can be calculated. The
most popular one is cosine distance. This similarity can be calculated by the dot
product between these two vectors. Therefore, the test document ( ) will be assigned
to the class whose class prototype vector is the most similar to the query vector ( )
of the test document.
Here, as stated before, is equal to 1 since the class prototype vector has been
normalized. Moreover, the normalization of the test document has no effect on ranking.
Therefore, the test document is assigned to the class when the dot product of the test
document vector and the class prototype vector achieves its highest value.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
251
6.1.2. Effect of Term distributions
Originally, Lertnattee and Theeramunkong (2004a, 2006a, 2006b, 2007b) have done a series of
research works to investigate the effect of term distributions on classification accuracy.
Therefore, the reader can find the full description of this work in those publications. In this
section, the summary of this work is given. Here, three types of term distributions, called inter-
class, intra-class and in-collection distributions, are introduced. These distributions are expected
to increase classification accuracy by exploiting information of (1) term distribution among
classes, (2) term distribution within a class and (3) term distribution in the whole collection of
training data. They are used to represent importance or classification power to weight that term
in a document. Another objective of this work is to investigate the pattern of how these term
distributions contribute to weight a term in documents. For example, high term distribution of a
word (or term) should promote or demote importance of that word. Here, it is also possible to
consider unigram or bigram as document representation.
Term distributions
The first question is what are the characteristics of terms that are significant for representing a
document or a class. In general, we can observe that (1) a significant term should appear
frequently in a certain class and (2) it should appear in few documents. These two properties can
be handled by the conventional term frequency and inverse document frequency, respectively.
However, we can observe more that (1) a significant term should not distribute very differently
among documents in the whole collection, (2) it should distribute very differently among classes,
and (3) it should not distribute very differently among documents in a class. These three
characteristics cannot be represented by conventional tf and idf. it is necessary to use
distribution (relative information) instead of frequency (absolute information). Distribution
related information that we can exploit includes distributions of terms among classes, within a
class and in the whole collection. Three kinds of this information can be defined as inter-class
standard deviation (icsd), class standard deviation (csd) and standard deviation (sd). Let be
term frequency of the term of the document in the class . The formal definitions of icsd, csd
and sd are given below.
where
is an average term frequency of the term in all documents within the class , is the number
of classes and is the number of documents in the class .
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
252
1. Inter-class standard deviation:
The inter-class standard deviation of a term is calculated from a set of average
frequencies , each of which is gathered from each class . This deviation is an inter-class
factor. Therefore, icsd for a term is independent of classes. A term with a high icsd distributes
differently among classes and should have higher discriminating power for classification
than the others. This factor promotes a term that exists in almost all classes but its
frequencies for those classes are quite different. In this situation, the conventional factors tf
and idf are not helpful.
2. Class-standard deviation
The class standard deviation of a term in a class . is calculated from a set of term
frequencies , each of which comes from term frequency of that term in a document in the
class. This deviation is an intra-class factor. Therefore, csds for a term vary class by class.
Different terms may appear with quite different frequencies among documents in the class.
This difference can be alleviated by the way of this deviation. A term with a high csd will
appear in most documents in the class with quite different frequencies and should not be a
good representative term of the class. A low trrcsd of a term may be triggered by either of the
following two reasons. The occurrences of the term are nearly equal for all documents in the
class or the term rarely occurs in the class.
3. Standard deviation:
The standard deviation of a term is calculated from a set of term frequencies , each of
which comes from term frequency of that term in a document in the collection. The deviation
is a collection factor. Therefore, sd for a term is independent of classes. Different terms may
appear with quite different frequencies among documents in the collection. This difference
can be also alleviated by the way of this deviation. A term with a high sd will appear in most
documents in the collection with quite different frequencies. A low sd of a term may be
caused by either of the following two reasons. The occurrences of the term are nearly equal
for all documents in the collection or the term rarely occurs in the collection.
Enhancement of term weighting using term distributions
The second question is how the above-mentioned term distributions contribute to term
weighting. The term distributions, i.e., icsd, csd and sd, can enhance the performance of a
centroid-based classifier with the standard weighting . Two issues of consideration are
whether these distributions should act as a promoter (multiplier) or a demoter (divisor) and
how strong they affect the weight. To grasp these characteristics, term weighting can be designed
using the following skeleton. Here, is a weight given to the term of the class .
The includes the factors of term distributions. The parameters , , and are numeric
values used for setting the contribution levels of icsd, csd and sd to term weighting, respectively.
For each parameter, a positive number means the factor acts as a promoter while a negative one
means the factor acts as a demoter. Moreover, the larger a parameter is, the more the parameter
contributes to term weighting as either a promoter or a demoter.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
253
Data sets and experimental settings
The following shows a set of experiments to investigate the effect of term distribution on
classification accuracy. Four data sets are used in the experiments: (1) Drug Information (DI), (2)
Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set, DI is a set of web pages
collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse
drug reaction, clinical pharmacology, description, indications, overdose, patient information, and
warning. Each web page in this data set consists of informative content with a few links. Its
structure is well organized. The second data set, Newsgroups contains 19,997 documents. The
articles are grouped into 20 different UseNet discussion groups. In this data set, some groups are
very similar. The third and fourth data sets are constructed from WebKB containing 8145 web
pages. These web pages were collected from departments of computer science from four
universities with some additional pages from some other universities. The collection can be
arranged to seven classes. In our experiment, we use the four most popular classes: student,
faculty, course and project as our third data set called WebKB1. The total number of web pages is
4199. Alternatively, this reduced collection can be rearranged into five classes by university
(WebKB2): cornell, texas, washington, wisconsin and misc (collected from some other
universities). The pages in WebKB are varied in their styles, ranging from quite informative
pages to link pages. Table 6-1 indicates the major characteristics of the data sets. More detail
about the document distribution of each class in WebKB is shown in Table 6-2.
Table 6-1: Characteristics of the four data set
Data sets DI News WebKB1 WebKB2
1. Type of docs HTML Plain Text HTML HTML
2. No. of docs 4480 19,997 4199 4199 3. No. of classes 7 20 4 5 4. No. of docs/class 640 1000 Varied Varied
Table 6-2: The distribution of the documents in WebKB1 an d WebKB2
WebKB1 WebKB2 Subtotal
Cornell Texas Washington Wisconsin Misc.
Course 44 38 77 85 686 930
Faculty 34 46 31 42 971 1124 Project 20 20 21 25 418 504 Student 128 148 126 156 1083 1641 Subtotal 226 252 255 308 3158 4199
For the HTML-based data sets (i.e., DI and WebKB), all HTML tags are eliminated from the
documents in order to make the classification process depend not on tag sets but on the content
of web documents. By the similar reason, all headers are omitted from Newsgroups documents,
the e-mail-based data set. For all data sets, a stop word list is applied to take away some common
words, such as a, for, the and so on, from the documents. This means when a unigram model is
occupied, a vector is constructed from all features (words) except stop words. In the case of a
bigram model, after eliminating stop words, any two contiguous words are combined into a term
for the representation basic. Moreover, terms occurring less than three times, are ignored.
6.1.3. Experimental Settings and Results
This section shows two experimental results as investigation of the effect of term distribution. In
the first experiment, term distribution factors are combined in different manners, and the
efficiencies of these combinations are evaluated. From now, let us call the classifiers that
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
254
incorporate term distribution factors in their weighting, term-distribution-based centroid-based
classifiers (later called TCBs). As the second experiment, top 10 TCBs obtained from the previous
experiment are selected for investigating the effect for term distribution factors in different types
of frequency-based factor in query weighting in both unigram and bigram models. Three types of
query weighting are investigated: term frequency, binary and augmented normalized term
frequency. The TCBs will be compared to a number of well-known methods as a baseline for
comparison: a standard centroid-based classifier (for short, SCB), a centroid-based classifier
modified the term weighting with information gain (for short, SCBIG), k-NN and naïve Bayes (for
short, NB). In both experiments, a data set is split into two parts: 90% for the training set and
10% for the test set. In the fourth experiment, since the objective is to investigate the effect of
training set size, we fix the size of a test set to 10% of the whole data set but vary the size of a
training set from 10% to 90%. All experiments perform 10-fold cross validation.
One of the most important factors towards the meaningful evaluation is the way to set
classifier parameters. Parameters that are applied to these classifiers are determined by some
preliminary experiments. For SCB, we apply the standard term weighting . For SCBIG, a
term goodness criterion called information gain (IG) is applied for adjusting the weight in SCB,
resulting in . The k values in k-NN are set to 20 for DI, 30 for Newsgroups and 50
for WebKB1, WebKB2 and WebKB12. Moreover, term weighting used in k-NN is
where means the maximum term frequency in a document. The k and this
term weighting performed well in our pretests. For NB, two possible alternative methods to
calculate the posterior probability are binary frequency and occurrence frequency. The
occurrence frequency is selected for comparison since it outperforms the binary frequency. The
query weighting for TCBs is by default. As the performance indicator, classification
accuracy is applied. It is defined as the ratio of the number of documents assigned with their
correct classes to the total number of documents in the test set.
Effect of term distribution factors
This experiment investigates the combination of term distribution factors in improving the
classification accuracy. Although the previous experiment suggests the role of each term
distribution factor, all possible combinations are explored in this experiment. Two following
issues are taken into account: (1) which factors are suitable to work together and (2) what is the
appropriate combination of these factors. To the end, we perform all combinations of icsd, csd
and sd by varying the power of each factor between -1 and 1 with a step of 0.5 and using it to
modify the standard weighting of . At this point, a positive number means the factor
acts as a promoter while a negative one means the factor acts as a demoter. The total number of
combinations is 125 (=5 5 5). These combinations include and six single-factor
term weightings. By the result, we find out that there are only 19 patterns giving better
performance than . The 20 best (top 20) and the 20 worst classifiers, according to
average accuracy on the four data sets, are selected for evaluation. Table 6-3 shows the number
of the best (worst) classifiers for each power of icsd, csd and sd. Moreover, the numbers in
parentheses show the numbers of the top 10 classifiers for each power. For more detail, the
characteristics and performances of the top 20 term weightings are shown in Table 6-4. Both
results are originally provided in (Lertnattee and Theeramunkong, 2004a).
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
255
Table 6-3: Descriptive analysis of term distribution factors (TDF) with different power of each
factor. Part A: the best 20 and Part 2: the worst 20 (best 10 and worst 10 in parenthesis)
(source: Lertnattee and Theeramunkong, 2004a)
TDF Power of the factor Number of methods -1 -0.5 0 0.5 1 Part A
0(0) 0(0) 6(2) 9(5) 5(3) 20(10) 5(4) 7(4) 6(2) 2(0) 0(0) 20(10) 9(4) 7(4) 4(2) 0(0) 0(0) 20(10)
Part B 6(1) 3(1) 3(2) 3(3) 5(3) 20(10) 4(0) 0(0) 1(0) 6(2) 9(8) 20(10) 1(0) 1(0) 1(0) 6(3) 11(7) 20(10)
Table 6-3 (part A) provides the same conclusion as the result obtained from the first
experiment. That is, sd and csd are suitable to be a demoter rather than a promoter while icsd
performs opposite. There are almost no negative results, except csd, and it is more obvious in the
case of the top 10. On the other hand, Table 6-3 (part B) shows that the performance is low if sd
and csd are applied as a promoter. However, it is not clear whether using icsd as a demoter
harms the performance.
Table 6-4: Classification accuracy of the 20 best term weightings (source: Lertnattee and Theeramunkong, 2004a)
Methods Power of Term weightings DI News WebKB1 WebKB2 Avg.
icsd csd sd
TCB1* 0.5 -0.5 -0.5 96.81 79.52 82.45 92.67 87.86
TCB2* 0.5 -1 0 95.16 79.73 81.90 93.17 87.49
TCB3* 0.5 -1 -0.5 92.25 83.17 78.88 93.71 87.00
TCB4* 1 -0.5 -1 96.65 77.70 82.90 90.21 86.87
TCB5* 0.5 0 -1 96.14 77.67 81.50 91.24 86.63
TCB6* 0.5 -0.5 -1 92.57 83.13 78.64 91.62 86.49 TCB7 1 -1 -1 91.07 82.17 80.09 92.28 86.40
TCB8* 1 -1 -0.5 94.80 78.79 80.16 91.14 86.22
TCB9* 0 -0.5 0 93.75 80.70 80.90 89.19 86.13
TCB10* 0 0 -0.5 92.90 78.97 79.11 92.86 85.96
TCB11 0.5 -0.5 0 96.45 74.40 80.28 92.02 85.79
TCB12 0 -0.5 -0.5 90.56 83.08 76.28 92.40 85.58
TCB13 1 -0.5 -0.5 96.18 71.49 80.81 89.76 84.56
TCB14 0.5 0 -0.5 95.58 72.69 78.64 90.93 84.46
TCB15 0 0.5 -1 90.92 78.21 77.02 91.40 84.39 TCB16 0 0 -1 88.55 82.70 73.45 90.71 83.85 TCB17 1 0 -1 96.00 69.76 79.95 89.50 83.80
TCB18 0.5 0.5 -1 93.95 71.73 78.45 90.24 83.59
TCB19 0.5 -1 -1 90.67 82.09 70.64 90.64 83.51 SCB 0 0 0 91.67 74.76 77.71 88.76 83.23
Table 6-4 also emphasizes the classifiers that outperform the standard in all four
data sets, with a mark *. Here, there are nine classifiers that are raised up. This fact shows that
there are some common term distributions that are useful generally in all data sets. Here, the
best term distribution in this experiment is . That is, the powers are
0.5 for icsd, and -0.5 for both csd and sd. However, it is observed that the appropriate powers of
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
256
term distribution factors depend on some characteristics of data sets. For instance, when the
power of csd changes from -0.5 to -1.0 (TCB1 to TCB3 in Table 6-4), the performances for DI and
WebKB1 decrease but those for Newsgroups and WebKB2 increase. This suggests that csd, a
class dependency factor, is more important in Newsgroups and WebKB2 than DI and WebKB1.
Experiments with different query weightings, and unigram/bigram models
In this experiment, top 10 TCBs obtained from the previous experiment are selected for
exploring the effect for term distribution factors in different types of query weighting in both
unigram and bigram models. In this experiment, the TCBs are compared to SCB, SCBIG, k-NN and
NB. Three types of query weighting are investigated: term frequency (n), binary (b) and
augmented normalized term frequency (a). The simple query weighting (n) sets term frequency
(or occurrence frequency) tf as the weight for a term in a query. The binary query weighting (b)
sets either 0 or 1 for terms in a query. The augmented normalized term frequency (a) defines
as a weight for a term in a query. This query term weighting is applied for
all centroid-based classifiers, i.e., TCBs, SCB and SCBIG. Furthermore, the query term weighting is
modified by multiplying the original weight with inverse document frequency (idf). The results
for unigram and bigram models are shown in Table 6-5 (panels A and B), respectively.
Table 6-5: Accuracy of the top 10 TCBs with different types of query weight compared to SCB,
SCBIG, k-NN and NB for unigram and bigram models (source: Lertnattee and Theeramunkong,
2004a)
Method DI News WebKB1 WebKB2
n b a n b a n b a n B a Part A (Unigram) TCB1 96.81 97.86 97.81 79.52 79.66 79.78 82.45 84.66 84.59 92.67 90.83 91.43 TCB2 95.16 95.96 95.87 79.73 80.93 80.95 81.90 85.33 85.12 93.17 93.05 93.21 TCB3 92.25 92.90 92.90 83.17 83.44 83.64 78.88 82.62 82.14 93.71 92.47 92.95 TCB4 96.65 97.46 97.39 77.70 77.72 77.88 82.90 85.12 85.02 90.21 88.02 88.83 TCB5 96.14 97.25 97.21 77.67 78.16 78.14 81.50 83.54 83.26 91.24 89.16 89.76 TCB6 92.57 93.06 93.01 83.13 83.30 83.46 78.64 81.07 80.61 91.62 89.19 89.76 TCB7 91.07 92.10 92.08 82.17 82.95 82.95 80.09 83.83 83.54 92.28 90.52 90.93 TCB8 94.80 96.36 96.32 78.79 79.53 79.62 80.16 84.12 84.02 91.14 89.69 90.12 TCB9 93.75 94.62 94.55 80.70 80.83 80.91 80.90 83.19 82.95 89.19 91.21 91.07 TCB10 92.90 94.51 94.38 78.97 79.38 79.57 79.11 81.47 81.21 92.86 91.71 92.19 SCB 91.67 92.99 93.01 74.76 75.29 75.37 77.71 78.66 78.73 88.76 91.12 91.07 SCBIG 96.19 97.43 97.39 60.83 59.31 60.40 75.02 78.78 78.26 90.26 89.59 89.95 k-NN 94.60 82.69 68.33 89.16 NB 95.00 80.82 81.40 87.45 Part B (Bigram) TCB1 98.73 99.35 99.35 81.83 82.37 82.36 84.19 86.71 86.35 93.88 93.36 93.43 TCB2 90.33 94.75 94.53 82.27 83.00 83.04 83.66 87.88 87.47 94.67 94.71 94.81 TCB3 97.90 99.24 99.22 85.15 85.20 85.24 82.47 85.69 85.26 95.52 94.98 95.19 TCB4 98.64 99.38 99.33 80.32 80.94 80.88 84.57 87.43 87.09 92.02 91.19 91.45 TCB5 98.04 98.68 98.68 81.22 81.94 81.91 83.40 85.76 85.43 92.74 92.17 92.36 TCB6 98.95 99.31 99.31 85.58 85.66 85.71 82.78 85.45 85.12 94.14 93.36 93.52 TCB7 85.25 90.13 89.71 84.80 84.98 84.92 82.88 86.78 86.31 94.05 93.47 93.57 TCB8 80.36 85.98 85.60 81.43 82.31 82.25 81.88 86.88 86.19 92.76 92.33 92.45 TCB9 98.42 98.93 98.91 82.77 83.05 83.01 82.47 85.16 84.76 93.81 94.88 94.88 TCB10 97.54 98.17 98.15 82.37 82.71 82.75 81.09 84.14 83.71 94.43 94.17 94.26 SCB 96.07 97.41 97.37 77.40 78.44 78.37 79.14 81.71 81.31 92.31 93.62 93.74 SCBIG 97.83 99.00 98.88 62.50 61.82 62.50 76.07 80.50 80.23 92.24 92.97 93.02 k-NN 97.48 82.75 70.16 91.62 NB 96.76 82.83 82.21 94.02
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
257
According to the results, we found out that the TCBs outperformed SCB, SCBIG, k-NN and NB
for almost cases, in both unigram and bigram models, independently of query weighting.
Normally the bigram model gains better performance than the unigram model. In the bigram, the
term distributions are still useful to improve classification accuracy. However, it is hard to
determine which query weighting performs better than the others but term distributions are
helpful for all types of query weighting. For SCBIG, the accuracy on DI significantly improves.
However, a little bit lower performance than SCB on average. The TCB1, TCB2 and TCB3 seem to
achieve higher accuracy than the others even TCB4 and TCB6 perform better in the bigram
model for DI and News, respectively.
Related works on centroid-based classification
Term weighting plays an important role to achieve high performance in text classification. In the
past, most approaches (Salton and Buckley, 1988; Skalak, 1994; Ittner et al., 1995; Chuang et al.,
2000; Singhal et al., 1996; Sebastiani, 2002) were proposed using frequency-based factors, such
as term frequency and inverse document frequency, for setting weights for terms. In these
approaches, the way to solve the problem caused by the situation that a long document may
suppress a short document is to perform normalization on document vectors or class prototype
vectors. That is, a vector for representing any document or any class is transformed into a unit
vector the length of which equals to 1. In spite of this, it is doubtful whether such frequency-
based term weighting is enough to reflect the importance of terms in the representation of a
document or a class or not. There were some works on adjusting weights using relevance
feedback approach. Among them, two popular schemas are the vector space model and the
probalistic networks. For the vector space model, the Rocchio feedback model (Rocchio, 1971;
Salton, 1989, Joachims, 1997) is the most common used method. The method attempts to use
both positive and negative instances in term weighting. One can expect more effective profile
representation generated from relevance feedback. For probabilistic networks approach, a query
can be modified by the addition of the first m terms taken from a list where all terms present in
documents deemed relevant are ranked (Robertson and Sparck-Jones, 1976).
The probalistic indexing technique was suggested by Fuhr (1989) and Joachims (1997) has
analysed a probabilistic consideration of this technique to the Rocchio classifier with
term weighting. Deng et al. (2002) introduced an approach to use statistics in a class, call
‘‘category relevance factor’’ to improve classification accuracy. Recently, Debole and Sebastiani
(2003) have evaluated some feature selection methods such as chi-square, information gain and
gain ratio. These feature selection methods were applied into term weighting for substituting idf
on three classifiers: k-NN, NB and Rocchio. From the result, these methods might be useful for k-
NN and support vector machine but seem useless for Rocchio. Recently the centroid-based
classifiers with the consideration of term distribution are explored by Han and Karypis (2000),
Lertnattee and Theeramunkong, (2004a; 2004b; 2005; 2006b; 2007a; 2007b; 2009) and
Theeramunkong and Lertnattee (2007). As a kind of term distribution, normalization is also an
important factor towards better accuracies as investigated by Singhal et al. (1995; 1996) and
Lertnattee and Theeramunkong (2003;2006a). A survey on statistical approaches for text
categorization was done by Yang (1999) and Yang and Liu (1999). Text classification with semi-
supervised learning can be found in (Nigam et al., 2000).
Conclusions
Section 6.1 shows that term distributions are useful for improving accuracy in centroid-based
classification. Three types of term distributions: interclass standard deviation (icsd), class
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
258
standard deviation (csd) and standard deviation (sd), were introduced to exploit information
outside/inside a class and that of the collection. The distributions were used to represent
discriminating power of each term and then to weight that term. To investigate the pattern of
how these term distributions contribute to weighting each term in documents, we varied term
distributions in their contribution to term weighting and then constructed a number of centroid-
based classifiers with different term weightings. The effectiveness of term distributions was
explored using various data sets. As baselines, a standard centroid-based classifier with
, a centroid-based classifiers with and two well-known methods, k-NN and
naïve Bayes are employed. Furthermore, both unigram and bigram models were investigated.
The experimental results showed the benefits of term distributions in classification. It was shown
that there was a certain pattern that term distributions contribute to the term weighting. It can
be claimed that terms with a low sd and a low csd should be emphasized while terms with a high
icsd should get more importance. For more detail, the reader can find from (Lertnattee and
Theeramunkong, 2004a).
6.2. Document Relation Extraction
Nowadays, it has become difficult for researchers to follow the state of the art in their area of
interest since the number of research publications has increased continuously and quickly. Such
a large volume of information brings about serious hindrance for researchers to position their
own works against existing works, or to find useful relations between them. Some research
works including (Kessler, 1963; Small, 1973; Ganiz, 2006), have been done towards the solution.
Although the publication of each work may include a list of related articles (documents) as its
reference, it is still impossible to include all related works due to either intentional reasons (e.g.,
limitation of paper length) or unintentional reasons (e.g., naively unknown). Enormous
meaningful connections that permeate the literatures may remain hidden.
Growing from different fields, known as literature-based discovery lead by Swanson (1986;
1990), the approach of discovering hidden and significant relations within a bibliographic
database has become popular in medical-related fields. As a content-based approach with
manual and/or semi-automatic processes, a set of topical words or terms are extracted as
concepts and then utilized to find connections among two literatures. Due to the simplicity and
practicality of this approach, it was used in several areas by its succeeding works (Gordon and
Dumais, 1998; Lindsay and Gordon, 1999; Pratt et al., 1999). Some works proposed citation
analysis based on so-called bibliographic coupling (Kessler, 1963) and co-citation (Small, 1973).
While they were successfully applied in several works (Nanba et al., 2000; White and McCain,
1989; Rousseau and Zuccala, 2004) to obtain topical related documents, they are not fully
automated and have a lot of labor intensive tasks. Based on association rule mining, an
automated approach to discover relations among documents in a research publication database
was introduced in Sriphaew and Theeramunkong (2005; 2007a; 2007b). Mapping a term (a word
or a pair of words) to a transaction in a transactional database, the topic-based relations among
scientific publications are revealed under various document representations. Although the work
expressed the first attempt to find document relations automatically by exploiting terms in
documents, it utilized only simple evaluation without elaborate consideration.
There has been little exploration of how to evaluate document relations discovered from text
collections. Most works in text mining utilized a dataset, which includes both queries and their
corresponding correct answers, as a test collection. They usually defined certain measures and
used them for performance assessment on the test collection. For instance, classification
accuracy is applied for assessing the class to which a document is assigned in text categorization
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
259
(TC) (Rosch, 1978) while recall and precision are used to evaluate retrieved documents with
regard to given query keywords in information retrieval (IR) (Salton and McGill, 1983). As a
more naive evaluation method, human judgment have been used in more recent works on mining
web documents, such as HITS (Kleinberg, 1999) and PageRank (Page et al., 1998), where there is
no standard dataset. However, this manual evaluation is a labor intensive task and quite
subjective. Moreover, there is a lack of standard criteria for evaluating document relations. So far,
while there have been several benchmark datasets, e.g., UCI Repository
(www.ics.uci.edu/~mlearn/MLRepository.html), WebKB (www.webkb.org), TREC data
(trec.nist.gov/data.thml), for TC and IR tasks, there is no standard dataset that is used for the
task of document relation discovery.
Toward resolving these issues, this section shows a brief introduction to a research work
that uses citation information in research publications as a source for evaluating the discovered
document relations. The full description of this work can be found in (Sriphaew and
Theeramunkong, 2007a).
Conceptually, the relations among documents can be formulated as a subgraph where each
node represents a document and each arc represents a relation between two documents. Based
on this formulation, a number of scoring methods are introduced for evaluating the discovered
document relations in order to reflect their quality. Moreover, this work also invents a generative
probability that is derived from probability theory and uses it to compute an expected score to
capture objectively how good evaluation results are.
6.2.1. Document Relation Discovery using Frequent Itemset Mining
A formulation of the ARM task on document relation discovery can be summarized as follows. Let
be a set of documents (items) where , and be a set of terms (transactions)
where . Also let represent the existence (0 or 1) of a term in a
document . A subset of is called a docset whereas a subset of is called a termset.
Furthermore, a docset with k documents is called k-docset (or a docset
with the length of k). The support of is defined as follows.
Here, an itemset with a support greater than a predefined minimum support is called a
frequent k-docset. We will use the term ``docset'' in the meaning of ``frequent docset'' and
``document relation'' interchangeably. Here, we need some kind of evaluation to assess which
document relations are better as one shown below.
6.2.2. Empirical Evaluation using Citation Information
This subsection presents a method to use citations (references) among technical documents in a
scientific publication collection to evaluate the quality of the discovered document relations.
Intuitively, two documents are expected to be related under one of the three basic situations: (1)
one document cites to the other (direct citation), (2) both documents cite to the same document
(bibliographic coupling) (Kessler, 1963) and (3) both documents are cited by the same document
(co-citation) (Small, 1973). An analysis of citation has been applied for several interesting
applications (Nanba et al., 2000; White and McCain, 1989; Rousseau and Zuccala, 2004).
Besides these basic situations, two documents may be related to each other via a more
complicated concept called transitivity. For example, if a document A cites to a document B, and
transitively the document B cites to a document C, then one could assume a relation between A
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
260
and C. In this work, with the transitivity property, the concept of order citation is originally
proposed to express an indirect connection between two documents. With the assumption that a
direct or indirect connection between two documents implies topical relation among them, such
connection can be used for evaluating the results of document relation discovery.
In the rest of this section, introductions of the u-th order citation and v-th order accumulative
citation matrix are given. Then, the so-called validity is proposed as a measure for evaluating
discovered docsets using information in the citation matrix. Finally, the expected validity is
mathematically defined by exploiting the concept of generative probability and estimation.
The Citation Graph and Its Matrix Representation
Conceptually citations among documents in a scientific publication collection form a citation
graph, where a node corresponds to a document and an arc corresponds to a direct citation of a
document to another document. Based on this citation graph, an indirect citation can be defined
using the concept of transitivity. The formulation of direct and indirect citations can be given in
the terms of the u-th order citation and the v-th order accumulative citation matrix as follows.
Definition 1: (the u-th order citation): Let be a set of documents (items) in the database. For
is the u-th order citation of x iff the number of arcs in the shortest path between x to y
in the citation graph is u ( ). Conversely, x is also called the u-th order citation of y.
Figure 6-1: An Example of a citation graph. (source: Sriphaew and Theeramunkong, 2007a)
For example, given a set of six documents and a set of six citations,
to , to and , to , and to and , the citation graph can be depicted in Figure
6-1. In the figure, , and is the first, is the second, and is the third order citation of
the document . Note that although there is a direction for each citation, it is not taken into
account since the task is to detect a document relation where the citation direction is not
concerned. Moreover, using only textual information without explicit citation or temporal
information, it is difficult to find the direction of the citation among any two documents.
Based on the concept of the u-th order citation, the v-th order accumulative citation matrix is
introduced to express a set of citation relations stating whether any two documents can be
transitively reached by the shortest path shorter than v+1.
Definition 2: (the v-th order accumulative citation matrix): Given a set of n distinct documents,
the v-th order accumulative citation matrix (for short, v-OACM) is an matrix, each element
of which represents the citation relation between two documents x, y where
when x is the u-th order citation of y and , otherwise . Note that
and .
Let be a set of documents (items) in the database. For is the u-th order citation of x
iff the number of arcs in the shortest path between x to y in the citation graph is u ( ).
Conversely, x is also called the u-th order citation of y. For the previous example, the 1-, 2- and 3-
d1 d2 d3 d4
d6 d5
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
261
OACMs can be created as shown in Table 6-6. Here, the 1-, 2- and 3-OACMs are represented by a
set of values [ ].
Table 6-6: the 1-, 2- and 3-OACMs are represented by a set of values [ ].
Document
[1,1,1] [1,1,1] [0,1,1] [0,0,1] [0,1,1] [0,0,0]
[1,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,1,1]
[0,1,1] [1,1,1] [1,1,1] [1,1,1] [1,1,1] [0,1,1]
[0,0,1] [0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1]
[0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,1,1]
[0,0,0] [0,0,1] [0,1,1] [1,1,1] [0,0,1] [1,1,1]
The 1-OACM can be straightforwardly constructed from the set of the first-order citation (direct
citation). The (v+1)-OACM (mathematically denoted by a matrix ) can be recursively created
from the operation between v-OACM ( ) and 1-OACM ( ) according to the following formula.
where is an OR operator, is an AND operator,
is the element at the i-th row and the k-th
column of the matrix and is the element at the k-th row and the j-th column of the matrix
. Note that any v-OACM is a symmetric matrix.
Validity: Quality of Document Relation
This section defines the validity which is used as a measure for evaluating the quality of the
discovered docsets. The concept of validity calculation is to investigate how documents in a
discovered docset are related to each other according to the citation graph. Based on this concept,
the most preferable situation is that all documents in a docset directly cite to and/or are cited by
at least one document in that docset, and thereafter they form one connected group. Since in
practice only few references are given in a document, it is quite rare and unrealistic that all
related documents cite to each other. As a generalization, we can assume that all documents in a
docset should cite to and/or are cited by each other within a specific range in the citation graph.
Here, the shorter the specific range is, the more restrictive the evaluation is. With the concept of
v-OACM stated in the previous section, we can realize this generalized evaluation by a so-called v-
th order validity (for short, v-validity), where v corresponds to the range mentioned above.
Regarding the criteria of evaluation, two alternative scoring methods can be employed for
defining the validity of a docset. As the first method, a score is computed as the ratio of the
number of citation relations in which the most popular document in a docset contains to its
maximum. The most popular document is a document that has the most relations with the other
documents in the docset. Note that, it is possible to have more than one popular document in a
docset. The score calculated by this method is called soft validity.
In the second method, a stricter criterion for scoring is applied. The score is set to 1 only
when the most popular document connects to all documents in the docset. Otherwise, the score is
set to 0. This score is called hard validity. The formulation of soft v-validity and hard $v$-validity
of a docset X ( , denoted by (X) and
(X) respectively, are defined as follows.
For simplicity, we denote a numerator in the above equation with . Then,
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
262
Here, is the citation relation defined by Definition 2. It can be observed that the soft v-
validity of a docset is ranging from 0 to 1, i.e., while the hard v-validity is a binary
value of 0 or 1, i.e. . In both cases, the v-validity achieves the minimum (i.e., 0) when
there is no citation relation among any document in the docset. On the other hand, it achieves the
maximum (i.e., 1) when there is at least one document that has a citation relation with all
documents in a docset. Intuitively, the validity of a bigger docset tends to be lower than a smaller
docset since the probability that one document will cite to and/or be cited by other documents in
the same docset becomes lower.
In practice, instead of an individual docset, the whole set of discovered docsets needs to be
evaluated. The easiest method is to exploit an arithmetic mean. However, it is not fair to directly
use the arithmetic mean since a bigger docset tends to have lower validity than a smaller one. We
need an aggregation method that reflects docset size in the summation of validities. One of
reasonable methods is to use the concept of weighted mean, where each weight reflects the
docset size. Therefore, soft v-validity and hard v-validity for a set of discovered docsets F,
denoted by (F) and
(F), respectively, can be defined as follows.
where is the weight of a docset X. In this work, is set to , the maximum value that
the validity of a docset X can gain. For example calculation, given the 1-OACM in Table 6-6 and
, the set soft 1-validity of F (i.e., (F)) equals to
while the set
hard 1-validity of F (i.e., (F)) is
.
The Expected Validity
The evaluation of discovered docsets will depend on the citation relation , which is
represented by v-OACMs. As stated in the previous section, the lower v is, the more restrictive the
evaluation becomes. Therefore to compare the evaluation based on different v-OACMs, we need
to declare a value, regardless of the restriction of evaluation, to represent the expected validity of
a given set of docsets under each individual v-OACM. This section describes the method to
estimate the theoretical validity of the set of docsets based on probability theory. Towards this
estimation, the probability that two documents are related to each other under a v-OACM (later
called base probability), need to be calculated. This probability is derived by the ratio of the
number of existing citation relations to the number of all possible citation relations (i.e.,
) as shown in the following equation.
For example, using the citation relation in Table 6-6, the base probabilities for 1-, 2-, and 3-
OACMs are 0.40 (12/30), 0.73 (22/30) and 0.93 (28/30), respectively. Note that the base
probability of a higher-OACM is always higher than or equal to that of a lower-OACM. Using the
concept of expectation, the expected set v-validity ( ) can be formulated as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
263
Where is the expected v-validity of a docset , is the set of all possible citation
patterns for , is the invariant validity of , and is the generative probability of the
pattern estimated from the base probabilities under v-OACM ( ). Theoretically, finding
possible patterns of a docset can be transformed to the set enumeration problem. Given a docset
with the length of k (k-docset), there are possible citation patterns.
With different scoring methods, an invariant validity is individually defined on each criteria
regardless of the v-OACM. To simplify this, the notation is replaced by and for
the invariant validity calculated from soft validity and hard validity, respectively. Similar to
, an invariant validity of for soft validity is defined as follows:
For simplicity, we denote a numerator in the above equation by . An invariant validity
of based on hard validity is given by:
In the above equations, is the citation relation among two documents x, y in the citation
pattern where =1 when citation relation exists, otherwise =0. Note that
all 's have the same docset but represent different citation patterns. The following shows two
examples of how to calculate the expected v-validity for 2-docsets and 3-docsets. For simplicity,
the expected v-validity based on soft validity is firstly described, and the one based on hard
validity is discussed later.
With the simplest case, there are only two possible citation patterns for a 2-docset. Therefore,
the expected v-validity based on soft validity of any 2-docset (X) can be calculated as follows.
Figure 6-2: All possible citation patterns for a 3-docset. (source: Sriphaew and Theeramunkong, 2007a)
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
264
In the case of a 3-docset, there are eight possible patterns as shown in Figure 6-2. Here, we can
calculate the invariant validity based on soft validity ( ) of each pattern as follows. The first to
fourth patterns have the invariant validity of 1 (i.e.
). The fifth to seventh patterns gains the
invariant validity of 0.5 (i.e.
) while the last pattern occupies the invariant validity of 0 (i.e.
).
The generative probability of the first pattern is since there are three citation relations, and
that of the second to the fourth patterns equals to since there are two citation
relations and one missing citation relation. Regarding the citation pattern, the generative
probabilities of the other patterns can be calculated in the same manner. From the generative
probabilities shown in Figure 6-2, the expected v-validity based on soft validity can be calculated
as follows.
Here, the first term comes from the first pattern, the second term is derived from the second to
the fourth patterns, the third term is obtained by the fifth to the seventh patterns and the last
term is for the eighth pattern.
With another criterion of hard validity, the expected v-validity for a 2-docset is still the same
but a difference occurs for a 3-docset. The invariant validity based on hard validity ( ) equals to
1 for the first to fourth patterns and becomes 0 for the other patterns. The expected v-validity for
a 3-docset based on hard validity is then reduced to
All above examples illustrate the calculation of the expected validity of only one docset. To
calculate the expected v-validity of several docsets in a given set, the weighted mean of their
validities can be derived. The outcome will be used as the expected value for evaluating the
results obtained from our method for discovering document relations.
6.2.3. Experimental Settings and Results
This subsection presents three experimental results when the quality of discovered docsets is
investigated under several empirical evaluation criteria. The three experiments are (1) to
investigate characteristic of the evaluation by soft validity and hard validity on docsets
discovered from different document representations including their minimum support
thresholds and mining time and (2) to study the quality of discovered relations when using either
direct citation or indirect citation as the evaluation criteria. More complete results can be found
in (Sriphaew and Theeramunkong, 2007a).
Towards the first objective, several term definitions are explored in the process of encoding
the documents. To define terms in a document, techniques of n-gram, stemming and stopword
removal can be applied. The discovered docsets are ranked by their supports, and then the top-N
ranked relations are evaluated using both soft validity and hard validity. Here, the value of N can
be varied to observe the characteristic of the discovered docsets. For the second objective, the
evaluation is performed based on various v-OACMs, where the 1-OACM considers only direct
citation while a higher-OACM also includes indirect citation. Intuitively, the evaluation becomes
less restricted when a higher-OACM is applied as the calibration. To fulfill the third objective, the
expected set validity for each set of discovered relations is calculated. Compared to this expected
validity, the significance of discovered docsets is investigated.
To implement a mining engine for document relation discovery, the FP-tree algorithm,
originally introduced by Han et al. (2000) is modified to mine docsets in a document-term
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
265
database. In this work, instead of association rules, frequent itemsets are considered. Since a 1-
docset contains no relation, it is negligible and then omitted from our evaluation. That is, only the
discovered docsets with at least two documents are considered. The experiments were
performed on a Pentium IV 2.4GHz Hyper-Threading with 1GB physical memory and 2GB virtual
memory running Linux TLE 5.0 as an operating system. The preprocessing steps i.e., n-gram
construction, stemming and stopword removal, consume trivial computational time.
Evaluation Material
There is no gold standard dataset that can be used for evaluating the results of document relation
discovery. To solve this problem, an evaluation material is constructed from the scientific
research publications in the ACM Digital Library (www.portal.acm.org). As a seed of constructing
the citation graph, 200 publications are retrieved from each of the three computer-related
classes, coded by B (Hardware), E (Data) and J (Computer). With the PDF format, each
publication is attached with an information page in which citation (i.e., reference) information is
provided. The reference publications appearing in these 600 publications are further collected
and added into the evaluation dataset. In the same way, the publications referred to by these
newly collected publications are also gathered and appended into the dataset. Finally, in total
there are 10,817 research publications collected as the evaluation material. After converting
these collected publications to ASCII text format, the reference (normally found at the end of each
publication text) is removed by a semi-automatic process, such as using clue words of References
and Bibliography. With the use of the information page attached to each publication, the 1-
OACMs can be constructed and used for evaluating the discovered docsets. The v-OACM can be
constructed from (v-1)-OACM and 1-OACM. In our dataset, the average number of citation
relations per document is 8 for 1-OACM, 148 for 2-OACM, and 1008 for 3-OACM. It takes 1.14
seconds for generating 2-OACM from 1-OACM while it takes 15.83 seconds to generate 3-OACM
from 2-OACM. Together with text preprocessing, the BOW library by McCallum,(1996) is used as
a tool for constructing a document-term database. Using a list of 524 stopwords provided by
Salton and McGill (1986), common words, such as ‘a,’ ‘an,’ ‘is,’ and ‘for’, are discarded. Besides
these stopwords, terms with very low frequency are also omitted. These terms are numerous and
usually negligible.
Experimental Results
As stated at the beginning of this section, several term definitions can be used as factors to obtain
various patterns of document representation. In our experiment, eight distinct patterns are
explored. Each pattern is denoted by a 3-digit code. The first digit represents the usage of n-gram,
where `U' stands for unigram and `B' means bigram. The second digit has a value of either `O' or
`X', expressing whether the stemming scheme is applied or not. Also the last digit is either `O' or
`X', telling us whether the stopword removal scheme is applied or not. For example, `UXO' means
document representation generated by unigram, non-stemming and stopword removal.
Table 6-7 and Table 6-8 express the set 1-validity (soft validity/hard validity) of the
discovered docsets when various document representations are applied for unigram and bigram,
respectively. The minimum support and the execution time of mining for each document
representation to discover a specified number of top-N ranked docsets are also given in the table.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
266
Table 6-7: Set 1-validity for various top-N rankings of discovered docsets, their supports and
mining time: soft validity/hard validity for the case of unigram. Here, minsup: minimum
support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a)
N Set Validity (%)
BXO BOO BXX BOX 1000 45.47/43.95 46.14/44.33 6.29/6.29 7.09/7.09
minsup=0.53,time=174.49 minsup=0.67,time=155.92 minsup=3.94,time=442.95 minsup=4.76,time=402.14
5000 29.31/23.88 29.13/27.24 3.83/3.33 3.88/3.59 minsup=0.35,time=188.88 minsup=0.47,time=166.96 minsup=3.15,time=612.82 minsup=3.79,time=570.65
10000 24.49/19.33 24.40/20.50 3.13/2.33 3.20/2.63 minsup=0.32,time=189.52 minsup=0.39,time=170.17 minsup=2.84,time=681.40 minsup=3.42,time=627.61
50000 19.29/ 6.36 18.88/ 8.62 2.46/0.98 2.36/1.19 minsup=0.25,time=195.39 minsup=0.29,time=176.48 minsup=2.31,time=816.43 minsup=2.71,time=767.25
100000 19.51/ 3.67 18.40/ 4.11 2.30/0.63 2.18/0.77 minsup=0.21,time=212.14 minsup=0.28,time=176.57 minsup=2.13,time=862.84 minsup=2.48,time=832.77
Average 27.61/19.64 27.39/20.96 3.60/2.71 3.74/3.05 minsup=0.33,time=192.08 minsup=0.42,time=169.22 minsup=2.87,time=683.29 minsup=3.43,time=640.08
Table 6-8: Set 1-validity for various top-N rankings of discovered docsets, their supports and mining time: soft validity/hard validity for the case of bigram. Here, minsup: minimum support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a)
N Set Validity (%)
UXO UOO UXX UOX 1000 3.88/3.78 2.36/2.26 2.79/2.79 1.76/1.76
minsup=32.72,time=122.49 minsup=46.35,time=74.77 minsup=55.61,time=160.98 minsup=74.78,time=89.39
5000 3.77/3.35 2.38/1.99 2.37/2.28 1.55/1.48 minsup=26.98,time=240.57 minsup=40.04,time=175.72 minsup=48.46,time=359.18 minsup=66.84,time=198.16
10000 3.47/2.63 2.16/1.53 2.09/1.75 1.35/1.11 minsup=24.68,time=312.69 minsup=37.63,time=231.41 minsup=45.66,time=466.00 minsup=63.76,time=277.67
50000 2.78/1.44 1.75/0.74 1.68/0.84 1.12/0.49 minsup=19.95,time=478.97 minsup=32.26,time=412.79 minsup=39.64,time=808.61 minsup=57.08,time=539.55
100000 2.71/1.02 1.68/0.48 1.66/0.57 1.14/0.32 minsup=18.37,time=564.65 minsup=30.40,time=531.10 minsup=37.40,time=1008.38 minsup=54.55,time=691.02
Average 3.32/2.44 2.06/1.40 2.12/1.64 1.38/1.03 minsup=24.54,time=343.87 minsup=37.34,time=285.16 minsup=45.35,time=560.63 minsup=63.40,time=359.16
From the table, some interesting observations can be made as follows. First, with the same
document representation, soft validity is always higher than or equal to hard validity since the
former is obtained by less restrict evaluation than the latter. Both validities involve valid
relations between any pair of documents in a discovered docset. A relation between two
documents is called valid when there is a link between those two documents under the v-OACM
(v=1 in this experiment). The evaluation based on soft validity focuses on the probability that
any two documents in a docset will occupy a valid relation. On the other hand, the evaluation
based on hard validity concentrates on the probability that at least one docset must have valid
relations with all of the other documents. For example, in the case of top-100000 ranking with
the `BXO' representation (as shown in Table 1), 19.51% of the relations in the discovered docsets
are valid while only 3.67% of the discovered docsets are perfect, i.e., there is at least one
document that contains valid relations with all of the other documents in the certain docset.
Second, in every document representation, both soft validity and hard validity become lower
when more ranks (i.e., top-N ranking with a larger N) are considered. As an implication of this
result, our proposed evaluation method indicates that better docsets are located at higher ranks.
Third, given two representations, say A and B, if the soft validity of A is better than that of B, then
the hard validity of A tends to be higher than that of B. Fourth, the results of the bigram cases
(`B**') are much better than those of the unigram cases (`U**'). One reason is that the bigrams
are quite superior to the unigrams in representing the content of a document. Fifth, in the cases
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
267
of bigram, the stopword removal process is helpful while the stemming process does not help
much. Sixth, in the cases of unigram, non-stemming is preferable while the stopword removal
process is not useful. Finally, the performance of `BXO' and `BOO' is comparable and much higher
than `BOX' and `BXX', while the performance of `UXO' is much higher than the other unigram
cases. However, on average, the `UXX' seems to be the second best case for the unigram. Since the
soft validity is more flexible than the hard validity, a higher soft validity is preferable. Although
performance of `BOO' seems to be slightly better than `BXO' in the higher ranks, `BXO' performs
better on average. In our task, the performance ranks for bigram is `BXO' >`BOO'> `BOX’ > `BXX’
and the performance ranks for unigram is `UXO' > `UXX' > `UOO’ > `UOX’.
In terms of minimum support and computation time, we can conclude as follows. First, since
a docset discovered from the bigram cases tends to have a lower support than the unigram cases,
it is necessary to set a small minimum support in order to obtain the same number of docsets.
Second, the cases with stopword removal run faster than ones without stopword removal since
they consider fewer words. Moreover, they tend to have a lower minimum support.
Besides 1-OACM, the discovered docsets can be evaluated with the criteria of 2-OACM and 3-
OACM. In this assessment, only four best representations, two from the unigram cases (`UXO' and
`UXX') and two from the bigram cases (`BXO' and `BOO'), are taken into consideration. Figure 6-3
displays the soft validity (the left graph) and the hard validity (the right graph) under 1-, 2-, and
3-OACMs. Since the minimum support and mining time in each case is the same as shown in
Table 6-7 and Table 6-8, they are omitted from the figure. In the figure, we use the notation to
represent the evaluation of docsets under the specified OACM where those docsets are
discovered from a specific document representation. For example, `3:BXO' means the evaluation
of docsets under 3-OACM where the docsets are discovered by encoding document
representation using the BXO scheme (bigram, non-stemming and stopword removal). Being
consistent for both soft validity and hard validity, the set 3-validity (one calculated under the 3-
OACM) of discovered docsets is higher than the set 2-validity (one calculated under the 2-OACM),
and in the same way the set 2-validity is much higher than the set 1-validity (one calculated
under the 1-OACM). Compared to the evaluation using only direct citation (1-OACM), more
relations in the discovered docsets are valid when both direct and indirect citations (2- and 3-
OACMs) are taken into consideration.
Similar to 1-OACM, `BXO' and `BOO' are comparable and perform as the best cases for both
soft validity and hard validity under the same OACM. Moreover, in the cases of bigram evaluated
under the 1- and 2-OACMs, the set validity drops remarkably when top-N rankings with a larger
N are focused.The quality of docsets in the higher rank (smaller N) outperforms that of the lower
rank. This outcome implies that our evaluation based on direct/indirect citations seems to be a
reasonable method for assessing docsets. For all types of document representation, the bigram
cases perform better than the unigram cases when they are evaluated under the same v-OACM.
Especially the cases under 3-OACM, where both two bigram cases (`3:BXO' and `3:BOO') are
almost 100% valid while two unigram cases (`3:UXO' and `3:UXX') are approximately 50% valid.
This phenomenon shows the advantage of bigram in being a good document representation for
document relation discovery and those documents in each docset cite to each other under the
specific range within citation graph. Furthermore, the performance gap between bigram and
unigram becomes smaller when top-N rankings with a larger N are considered. For a top-N
ranking with a larger N, the bigram cases tend to have bigger docsets than the unigram cases and
then obtain lower validity since naturally a bigger docset is likely to have lower validity.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
268
Figure 6-3: Set validity based on the 1-, 2- and 3-OACMs when various top-N rankings of discovered docsets are considered: soft validity (left) and hard validity (right). (source: Sriphaew and Theeramunkong, 2007a)
Conclusions
Section 6.2 shows a method to use citation information in research publications as a source for
evaluating the discovered document relations. Three main contributions of this work are as
follows. First, soft validity and hard validity are developed to express the quality of docsets
(document relations), where the former focuses on the probability that any two documents in a
docset has a valid relation while the latter concentrates on the probability that at least one
document in a docset has valid relations with all of the other documents in that docset. Second, a
method to use direct and indirect citations as comparison criteria is proposed to assess the
quality of docsets. Third, the so-called expected validity is introduced, using probability theory,
to relatively evaluate the quality of discovered docset. By comparing the result to the expected
validity, the evaluation becomes impartial, even under different comparison criteria. The manual
evaluation was also done for performance comparison. Using more than 10,000 documents
obtained from a research publication database and frequent itemset mining as a process to
discover document relations, the proposed method was shown to be a powerful way to evaluate
the relations in four aspects: soft/hard scoring, direct/indirect citation and relative quality over
the expected validity. For more detail, the reader can find in (Sriphaew and Theeramunkong,
2007a).
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
269
6.3. Application to Automatic Thai Unknown Detection
Unknown word recognition plays an important role in natural language processing (NLP) since
words, fundamental units of a language, may be newly developed and invented. Most NLP
applications need to identify words in sentences before further manipulation. Word recognition
can be basically done by using a predefined lexicon, designed to include as many words as
possible. However, in practical, it is impossible to have a complete lexicon that includes all words
in a language. Therefore, it is necessary to develop techniques to handle words not presented in
the lexicon, so-called unknown words. In languages with explicit word boundary, it is
straightforward to identify an unknown word and its boundary. This simplicity is not conformed
to languages without word boundary (later called unsegmented language such as Thai, Japanese,
Chinese), where words are running without any explicit space or punctuation mark (Cheng et al.,
1999; Charoenpornsawat et al., 1998; Ling et al., 2003; Ando and Lee, 2000). Whereas analyzing
such languages requires word segmentation, existence of unknown words made segmentation
(or word recognition) accuracy lower (Theeramunkong and Tanhermhong, 2004; Asahara and
Matsumoto, 2004; Jung-Shin and Su, 1997). Accurate detection of unknown words and their
boundaries is mandatory towards high-performance word segmentation. As a similar task, word
extraction in unsegmented languages has also been explored in several studies (Su et al., 1994;
Chang and Su, 1995; Ge et al., 1999; Zhang et al., 2000; Zhang et al., 2008). Instead of segmenting
a running text into words, word extraction methods directly detect a set of unknown words from
the text without determining boundaries of all words in the text. In Thai, our target language,
major sources of unknown words are (1) Thai transliteration of foreign words, (2) invention of
Thai new technical words, and (3) emerging of Thai proper names. For example, Thai medical
texts often abound in transliterated words/terms or technical words/terms, related to diseases,
organs, medicines, instruments or herbs, which may not be in any dictionary. Thai news articles
usually include a lot of proper names related to persons, organizations, locations and so forth.
Indirectly related to unknown word recognition, Thai compound word extraction and word
segmentation without dictionaries were explored in (Sornlertlamvanich and Tanaka, 1996;
Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004). Without any
dictionary, these methods applied pure statistics with a kind of machine learning techniques to
detect compound words by observing frequently occurred substrings in texts. However, it seems
natural to utilize a dictionary for segmentation and simultaneously recognize unknown words
when substrings do not exist in the dictionary.
In the past, several works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) have been
proposed to recognize both explicit and implicit unknown words. Forming from multiple
contiguous words, an implicit unknown word could be detected by observing its Co-occurrence
frequency. On the other hand, an explicit unknown word was triggered by an undefined substring,
and its boundary could be found by first generating boundary candidates with respect to a set of
predefined rules and applying statistical techniques to select the most probable one. However,
one of shortcomings in most previous approaches is that they required a set of manually
constructed rules to restrict generating candidates of an unknown word boundary. To get rid of
this limitation, this paper proposes a method to generate a set of all possible candidates without
constraining by any handcrafted rule. However, with this relaxation, a large set of candidates
may be generated, inducing the problem of unbalanced class sizes where the number of positive
unknown word candidates is dominantly smaller than that of negative candidates. To solve the
problem, a technique called group-based ranking evaluation (GRE) is incorporated into ensemble
learning, namely boosting, in order to generate a sequence of classification models that later
collaborate to select the most probable unknown word from multiple candidates. As the boosting
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
270
step, given a classification model, the GRE technique is applied to build a dataset for training the
succeeding model, by weighing each of its candidates according to their ranks and correctness
when the candidates of an unknown word are considered as one group. By experiments, the
proposed method, namely V-GRE, is evaluated using a large Thai medical text.
Although research on unknown word recognition in Thai language has not been widely
conducted as done in other languages, two approaches have been proposed in detecting
unknown words from a large corpus of Thai texts, later called Machine Learning-based (ML-
based) approach and dictionary-based approach (Theeramunkong et al., 2000; Theeramunkong
and Tanhermhong, 2004). In the ML-based approach, unknown word recognition can be viewed
a process to detect new compound words in a text without a process of using a dictionary to
segment the text into words. The dictionary-based approach attempts to identify the boundary of
an unknown word when a system faces with a character sequence which is not registered in a
dictionary during segmenting a text into a sequence of words. As an early work of the first
approach, Sornlertlamvanich and Tanaka (1996) had presented a method to use frequency
difference between the occurrences of two adjoining sorted n-grams (a special case of sorted
sistrings) to extract open compounds (uninterrupted sequences of words) from text corpora.
Moreover, competitive and unified selections are applied to discriminate between an illegible
string and a potential unknown word. By specifying a different threshold of frequency
differences, the method can detect a various number of extracted strings (unknown words) with
an inherent trade-off between the quantity and the quality of the extracted strings. As two
limitations, the method requires manual setting of the threshold, and it applies only frequency
difference that may not be enough to express the distinction between an unknown word and a
common prefix of words. To solve these shortcomings, some works (Kawtrakul et al., 1997;
Theeramunkong et al., 2000; Sornlertlamvanich et al., 2000; Theeramunkong and Tanhermhong,
2004; Sornil and Chaiwanarom, 2004) applied machine learning (ML) techniques to detect an
unknown word by using statistical information of contexts surrounding that potential unknown
word. Sornlertlamvanich et al. (2000) presented a corpus-based method to learn a decision tree
for the purpose of extracting compound words from corpora. In the same period, a similar
approach was proposed in (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong,
2004) to construct a decision tree that enables us to segment a text without making use of a
dictionary. It was shown that even no dictionary, the ML-based methods could achieve up to
85%-95% of word segmentation accuracy or word extraction rate. As the second approach,
Kawtrakul et al. (1997) used the combination of a statistical semantic segmentation model and a
set of context sensitive rules to detect unknown words in the context of a running text. The
context sensitive rules were applied to extract information related to such an unknown word,
mostly representing a name of an entity, such as person, animal, plant, place, document, disease,
organization, equipment, and activity. Charoenpornsawat et al. (1998) considered unknown
word recognition as a classification problem and proposed a feature-based approach to identify
Thai unknown word boundaries. Features used in the approach are built from the specific
information in context surrounding the target unknown words. Winnow proposed by Blum
(1997) is an ML algorithm used to automatically extract features from the training corpus.
As a more recent work, Haruechaiyasak et al. (2006) proposed a semi-automated framework
that utilized statistical and corpus-based concepts for detecting unknown words and then
introduced a collaborative framework among a group of corpus builders to refine the obtained
results. In the automated process, unknown word boundaries are identified using frequencies of
strings. In (Haruechaiyasak et al., 2008), a comparison of dictionary-based approach and ML-
based approach for word segmentation was presented where unknown word detection is
implicitly handled. Since either of the dictionary-based and ML-based approaches has its
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
271
advantages, most previous works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998;
Theeramunkong et al., 2000) combined them to handle unknown words. Although several works
have been done in both approaches, they have some shortcomings: 1) most works dominantly
separated learning process from word segmentation process; 2) they used only local information
to learn a set of rules for word segmentation/unknown word detection by a single-level learning
process (a single classifier); and 3) they required a set of handcrafted rules to restrict generating
candidates of an unknown word boundary. To overcome these disadvantages, this work provides
a framework to combine word segmentation process with learning process that utilizes long-
distance context in learning a set of rules for unknown word detection in word segmentation
process, where no manual rules are required. Moreover, our learning process also occupies
boosting techniques to improve classification accuracy.
6.3.1. Thai Unknown Words as Word Segmentation Problem
Most word segmentation algorithms used a lexicon (or a dictionary) to parse a text at the
character level. In general, when a system meets an unknown word, three possible segmented
results can be expected as an output. The first one is to obtain one or more sequences of known
words from an unknown (out-of-dictionary) word, especially for the case of a compound word.
For example, มะม่วงอกร่อง (meaning: a kind of mango) can be segmented into มะม่วง (meaning:
mango), อก (meaning: breast), and ร่อง (meaning: crack). All of these sub words are found in the
lexicon. The second one is to gain a sequence of unknown segments which are undefined in the
lexicon. For example, we cannot detect any sub word from an out-of-dictionary word วิสญัญี
(meaning: Anesthetic) since all of its substrings do not exist in the dictionary. The last one is to
get a sequence of known words mixed with unknown segments. For instance, an unknown word
ลูคีเมีย (meaning: Leukemia) can be segmented into two portions: an unknown segment (ลูคี,
meaning: unknown) and a known word (เมีย, meaning: wife).
In terms of processing, these three different results can be interpreted as follows. When we
get a result of the first type, it is hard for us to know whether the result is an unknown word
since it may be misunderstood to be multiple words existing in a dictionary. This type of
unknown words is known as a hidden unknown word. Called as an explicit unknown word, the
second type is easily recognized since the whole word is composed of only unknown segments,
Namely a mixed unknown word, a third-type unknown word is also hard to recognize since the
boundary of the unknown word is unclear.
Furthermore, it is also difficult for us to distinguish between the second and the third type.
However, the second and third types will have unknown segments, later called unregistered
portions, that trigger us to know existence of an unknown word. This work focuses on
recognition of an unknown word of the second and third types, a detectable unknown word.
6.3.2. The Proposed Method
This section describes the proposed method in short. The reader can find the full description in
(TeCho et. al, 2009b). The proposed method consists of three processes: (1) unregistered portion
detection, (2) unknown word candidate generation and reduction, and (3) unknown word
identification, as shown in Figure 6-4.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
272
Figure 6-4: Overview of the proposed method (source: TeCho et al., 2009b)
Unregistered Portions Detection
Normally when we apply word segmentation on a Thai running text with some unknown words,
we may face with a number of unrecognizable units due to out-of-vocabulary words. Moreover,
without any additional constraints, an existing algorithm may place segmentation boundaries at
obviously incorrect positions. For example, the system may place an impossible word boundary
between a consonant and a vowel. To resolve such obvious mistakes, recently several works
(Sornil and Chaiwanarom, 2004; Haruechaiyasak et al., 2006; Theeramunkong and Usanavasin,
2001; Viriyayudhakorn et al., 2007; Limcharoen, 2008) have applied a useful concept, namely a
Thai Character Cluster (TCC) (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong,
2004), which is defined as an inseparable group of Thai characters based on the Thai writing
system. Unlike word segmentation, segmenting a text into TCCs can be completely done without
error and ambiguity by a small set of predefined rules. The result from TCC segmentation can be
used to guide word segmentation not to segment at unallowable positions. To detect unknown
words, TCCs can be used as basic units of processing.
Using techniques originally proposed by TeCho et al. (2008a; 2008b; 2009a; 2009b), this
work employs the combination of TCCs and the LEXiTRON dictionary (2008) to facilitate word
segmentation. In this work, the longest word segmentation (Poowarawan, 1986) is applied to
segment the text from either left-to-right (LtoR) or right-to-left (RtoL) manner and then the
results are compared to select one with the minimum number of unregistered portions. If the
number of unregistered portions from LtoR longest matching equals to that of RtoL, the result of
the LtoR longest matching will be selected.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
273
Unknown Word Candidate Generation and Reduction
For candidate generation, ±h TCCs surrounding an unregistered portion are merged to form an
unknown word candidate. By this setting, (h + 1)2 possible candidates can be generated for each
unregistered portion. Since, a word in Thai cannot comprise of any special characters, it is
possible to have a smaller number of candidates using surface constraints, such as space or
punctuation. To filter out unrealistic candidates, two sets of separation markers are considered.
The first set contains four types of marker words, such as (1) Conjunctives words: e.g., ก็ต่อเม่ือ
(meaning: when), นอกจากน้ี (meaning: Besides this) etc., (2) Preposition words: e.g., ตั้งแต่ (meaning:
since), ส าหรับ (meaning: for) etc., (3) Adverb words: e.g., เด๋ียวน้ี (meaning: at this moment), มากกวา่
(meaning: more than) etc., and (4) Special verbal words: e.g. , หมายความวา่ (meaning: means),
ประกอบดว้ย (meaning: comprised with) etc. The second set includes five types of special characters
as follows: (1) Interword seperations: i.e., a white space, (2) Punctuation Marks: e.g., ?, -, (…) etc.,
(3) General typography signs: e.g.,%, ฿ etc., (4) Numbers: including both Arabic (0, …, 9) and Thai
(๐, …, ๙) numbers, and (5) Foreign characters: English alphabets (including capital letters).
Unknown Word Identification
In the past, most previous works on Thai unknown word recognition (Sornlertlamvanich and
Tanaka, 1996; Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004;
Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) treated unknown word candidates
independently. However, in the real situation, a set of candidates generated from an unregistered
portion, should be considered dependently and treated as a group. In the learning process, each
candidate in a group is labeled as a positive or negative instance. Although several candidates can
be generated from an unregistered portion, typically only few (just one or two) candidates are
the potential unknown words. This phenomenon forms an unbalanced dataset problem. For
example, Table 6-9 shows the rank of each candidate, where only two out of forty two candidates
are eligible unknown words, i.e., rank 1 and 32. After the ranking process, the most probable
candidate is selected as a suggested unknown word.
Table 6-9: Example output of predicted unknown word candidates ranked in a group by the
proposed method (source: TeCho et al., 2009b)
Rank Unknown Word Candidate (c) P(+|c) Actual Class Predicted Class
1 คโีตโคนาโซล 9.9988510-01 Y Y
2 ใชแ้ชมพคู ี 9.9899510-01 N Y
3 ใชแ้ชมพคูโีต 9.9552110-01 N Y
… … … … …
30 มพคูโีตโคนาโซล 4.3351510-04 N N
31 แชมพคูโีตโคนาโซ 1.0661210-04 N N
32 แชมพคูโีตโคนาโซล 8.5327910-05 Y N
… … … … …
40 คโีตโค 2.6328910-22 N N
41 คโีต 8.8801710-53 N N
42 ค ี 4.5928810-97 N N
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
274
Feature Extraction
As stated in the previous section, TCCs are used as processing units. We therefore use a sequence
of TCCs instead of a sequence of characters to denote an unknown word candidate. To specify
whether a candidate is the most probable unknown word or not, a set of suitable features need to
be considered. In this work, several statistics collected from context around an unknown word
can be considered as features. Next, in order to fasten the process to collect statistics from a text,
we apply the algorithm proposed by Nagao and Mori (1994) that utilizes sorted sistrings. For
each sistring (i.e., unknown word candidate), eight types of features, (f1)-(f8), are extracted. To
explain these features, the following description is first given.
Let A be a set of possible Thai characters, B be a set of possible TCCs, E be a set of possible
special characters, C = c1c2c3 . . . c|C| (ci A E) be a corpus, di D be the i-th document which is
a substring of C, Di = C[bi:ei] (bi is the position of the first character in the document Di, ei-1 is the
position of the last character in the document Di, C[ei:ei] is a special character specifying the end
of the document Di), bi = ei-1+1, T = t1t2t3 . . . t|T| (ti B) be the segmented corpus of C as a TCC
sequence, t1 = C[1:u], ti = C[v:w], ti+1 = C[(w+1):x], and t|T| = C[y:|C|], and W be a set of all possible
words in the dictionary. An unknown word candidate S can be defined by a substring of T, ST =
T[p:q] (= tp . . . tq) where p and q are the starting and ending TCC positions of S, respectively. Also,
the candidate S can be expressed by a substring of C, SC = C[r:s] (= cr . . . cs) where r and s are the
starting and ending character positions of S, respectively. As one restriction, no special character
is allowed in S. With the above description, eight features, (f1)-(f8), can be formally defined in
sequence as follows.
(f1) Number of TCCs (Nt)
The number of TCCs can be used as a clue to detect unknown words. Intuitively, several unknown
words are technical words each of which is a transliteration of an English technical term, and
many of them are very long. Formally, the number of TCCs in an unknown word candidate S,
Nt(S), can be defined as follows.
Nt(S) = |ST|
(f2) Number of Characters (Nc)
Similar to the number of TCCs, the number of characters in a sequence is another factor to
determine whether the sequence is a potential word or not. Concretely, an unknown word tends
to be long. The number of characters in an unknown word candidate S can be defined as follow.
Nc(S) = |SC|
(f3) Number of known words (Nw)
Like several languages, some unknown words in Thai language can be viewed as a compound
word that contains a number of known words. Therefore, when we recognize a sistring as an
unknown word, the number of known words in such sistring can be used as a clue to identify
whether the sistring is an unknown word. The number of known words can be defined as follows.
Nw(S) = |{w|w=S[a:b] w W}|
where S[a:b] is a substring of S starting from a to b.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
275
(f4) Sistring Frequency (Nf )
The sistring frequency is useful information for determining whether the sistring is a word. The
number of occurrences of a sistring which is an unknown word tends to be higher than that of a
sistring which is not possible to be a word. The definition of sistring frequency is as follows.
Nf(S) = |{C[c:d] | C[c:d] = S 1 c d |C|}|
where C[c:d] is a substring of C starting from c to d and c, d range from 1 to |C|.
(f5) Left and Right TCCs variety (Lv,Rv)
The variety expresses the potential TCCs which come before or after a string. It implies the
impurity or uncertainty. Left (Right) variety is defined as the number of distinct TCCs actually
occurring before (after) an unknown word candidate. The high variety of distinct TCCs on the
left-hand side (right-hand side) is one of indicators to guess whether the candidate should be
detected as an unknown word. We therefore used the number of distinct TCCs on the left and
right-hand side as a feature. The definitions of left and right TCC variety are as follows.
Lv(S) = |d({T[a:a] | T[a+1:b] = S 1 a b |T|}|
Rv(S) = |d({T[b:b] | T[a:b-1] = S 1 a b |T|}|
where the d(L) returns the set of distinct elements in L, T[a:b] is a substring of T starting from a
to b and a, b range from 1 to |T|. T[a:a], T[b:b] are the TCC that co-occured on the left-hand side
and right-hand side of S in the corpus, respectively.
(f6) Probability of a special character on left and right (Ls,Rs)
The probability that a special character co-occurs on the left-hand side and the right-hand side of
the considering candidate indicates that the candidate is located near delimiters and should be
detected as an unknown word. We, therefore, used them as a feature. The definitions of
probability of a special character on left and right (Lv and Rv) are as follows.
LS(S) =
RS(S) =
where C[d:e] is a substring of C starting from d to e and d, e range from 1 to |C|, Nf (S) returns a
number of unknown word candidate S occurred in the corpus.
(f7) Inverse Document Frequency (IDF)
The inverse document frequency is a good measurement to specify the importance of a sistring.
Since an unknown word is a word that does not happen frequently in several documents, but it
appears frequently in only some specific documents. In addition, the high IDF means the sistring
is likely to be a unknown word. It was obtained by dividing the number of all documents by the
number of documents containing the term, and then taking the logarithm of that quotient. The
formal definition of IDF(S) (the inverse document frequency of S) is as
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
276
where log is the natural logarithm, |D| is the total number of documents in the corpus, and |DS| is
the number of documents where S appears.
(f8) Term Frequency with Inverse Document
Frequency (TFIDF) The TFIDF is a weight often used in information retrieval and text mining.
This weight is a statistical measure used to evaluate how important a word in a document in a
collection or corpus is. The importance increases proportionally to the number of times a word
appears in the document but is offset by the frequency of the word in the corpus. The definition
of TFIDF(S) is as
TF
.
Emsemble Classification with Group-based Ranking Evaluation Technique
This section describes four main aspects of our proposed approach in learning an ensemble
classifier for identifying unknown words. As the first aspect, exploiting the features extracted
from the training corpus, naïve Bayesian is applied to learn a base classifier to assign a
probability to each unknown word candidate, representing how likely the candidate is a suitable
unknown word for an unregistered portion. Second, a mechanism namely Group-based Ranking
Evaluation (GRE) is introduced to select the most probable unknown word for an unregistered
portion with the consideration of ranking in a group of unknown word candidates generated
from the same unregistered portion at a specific location. Third, a GRE-based boosting is
employed to generate a sequence of classifiers, where each consecutive classifier in the sequence
works as an expert in classifying instances that were not classified correctly by its preceding
classifier and a confidence weight is given to each generated classifier based on its GRE-based
performance. Fourth, a so-called Voting Group-based Ranking Evaluation (V-GRE) technique is
implemented to combine the results obtained from a sequence of classifiers in classifying a test
instance, with the consideration of the confidence weight of each classifier. The details of these
aspects are illustrated in order as follows.
Naïve Bayesian Classification
Based on naïve Bayesian method, the probability that a generated candidate c (characterized by a
set of features F = {f1, f2, . . . , f|F|}) is an unknown word, can be defined as follows.
where , is the probability that the candidate c is an unknown word, is the
unnormalized probability that the candidate c is an unknown word (positive class), and
is the unnormalized probability that the candidate c is not an unknown word (negative
class), (or is the prior probability that the class is positive (or negative),
(or ) is the probability that the feature is when the class is positive (or negative). Here,
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
277
both and are derived from the independence assumption of naïve Bayesian. For
continuous attributes ( ), Gaussian distribution with smoothing can be applied as follows.
where (or ) is the mean of the positive (or negative) class, (or ) is the standard
deviation of the positive (or negative) class, and ϵ is a small positive constant used for smoothing
to resolve sparseness problem. It is set to 0.000001 in our experiments.
Group-based Ranking Evaluation
Unlike the evaluation model in a traditional classifier, our proposed technique Group-based
Ranking Evaluation (GRE) categorizes all candidates produced by the same unregistered portion
location into the same group. This technique ranks all candidates with respect to their group
based on their probabilities to be an unknown word, and selects then the candidate with the
highest probability within that group as the potential prediction for the unknown word.
where is the most probable candidate of the i-th group, is a group of candidates generated
from the i-th unregistered portion, and is the probability that c is an unknown word. To
be more flexible, it is also possible to relax to accept top-t candidates as the potential unknown
words.
GRE-based Boosting
AdaBoost (Freund and Schapire, 1999) is a technique to repeatedly construct a sequence of
classifiers based on a base learning method. In this technique, each instance in the training set is
attached with a weight (initially set to 1.0). In each iteration, the base learning method constructs
a classifier using all instances in the training set, and with their weights showing the importance.
After evaluation the obtained classifier, the weights of the misclassified examples are increased
to make the learning method focus more on the misclassified examples in the next iteration.
Originally, AdaBoost evaluates each instance and updates its weight individually. This
technique is not suitable for the unknown word data that we treat them as groups of unknown
word candidates. We then propose a new technique called GRE-based Boosting to efficiently
apply AdaBoost technique to the unknown word data. In this technique, a weight is assigned to
each group of candidates. After constructing a base classifier, each group is evaluated‘ based on
the GRE technique explained in the previous section. The classifier is considered to misclassify a
group when the top ranked candidate in the group is not a correct unknown word. The weight of
that group is then increased to make the group be more focused in the next iteration.
Figure 6-5 shows the overall process of the proposed GRE-based boosting technique. Initially,
, a training set with all groups weighted by 1.0, are fed to INDUCER, a base learning method, in
order to generate a classifier . The obtained model is passed to GRE-INCOR to evaluate and
obtain the misclassified groups. Then, , a confidence weight of the classifier, and , a ratio of
success to unsuccess rate, are calculated from the misclassifying rate (as explained in Algorithm
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
278
1). The confidence weight ( ) represents the performance of the classifier. It is later used to
represent the strength of the classifier when the results from several classifiers are combined in
the evaluation step. The ratio of success to unsuccess rate ( ) is used as the new weights of the
misclassified groups in the next iteration. Basically, this is larger than 1. Hence, the classifier
constructed in the next iteration will be specialized to the previously misclassified instances.
Figure 6-5: GRE-based Boosting (source: TeCho et al., 2009b)
Algorithm 1: GRE-based Boosting
Input: is an initial training set with all weights set to 1.0
is the number of iterations.
Output: is a set of base classifiers
1: ;
2: ;
3: for k=1 to K do
4: ;
5: - ;
6: ;
7: ;
8: ;
9: ;
10: foreach do
11: if then
12: ;
13: else
14: ;
15: end
16: end
17: ;
18: end
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
279
Algorithm 1 shows the GRE-based boosting technique in details. The algorithm starts with the
initial training set with and , where
is the group of unknown word candidates generated for the i-th unregistered portion, is the
j-th candidate of the i-th unregistered portion, is an initial weight (set to 1 at the first
iteration) given to , is the number of unregistered portions, is the number of unknown
word candidates generated for the i-th unregistered portion, is the set of feature values
representing , and is the target attribute of (designated as the class label),
stating whether is the correct unknown word (+1) or not (-1). iterations are conducted to
construct a sequence of base classifiers. At the k-th iteration, a training set is fed to INDUCER
to construct a base classifier mk. The classifier is then evaluated by GRE-INCOR yielding ,
a set of misclassified groups. , the error rate of the classifier , can be calculated from . It is
used to calculated , and which are the parameters showing the confidence level of the
classifier, and the weight for the iteration. Finally, the weight of the misclassified group is set
to . Otherwise, it is set to 1.
Voting Group-based Ranking Evaluation
From the previous step, we obtain a sequence of base classifiers. Each classifier is attached with
its confidence weight ( ). In this section, we propose a technique called Voting Group-based
Ranking Evaluation to evaluate a group of unknown word candidates, and predict the unknown
word by combining votes from all base classifiers. Figure 6-6 shows the process to evaluate a
given group of unknown word candidates. Each candidate in the group is fed to all the classifiers
to obtain the probabilities that the candidate is a correct unknown word. Each probability is
weighted by the confidence weight of the corresponding classifier. These weighted probabilities
are summed for each candidate. Finally The candidate with the highest summed probability value
is chosen as an unknown word.
Figure 6-6: Voting Group-base Ranking Evaluation (source: TeCho et al., 2009b)
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
280
Algorithm 2: Voting GRE (V-GRE)
Input: is a set of base classifiers
is a set of unknown word groups.
Output: is the set a member of which is the set of the p suggested unknown words
for each unregistered portion.
1: ;
2: foreach do
3: ;
4: foreach do
5:
6: foreach do
7: ;
8: ;
9: end
10:
;
11: end
12: - -
;
13: ;
14: end
Algorithm 2 shows the evaluation process in details. This algorithm uses as inputs a set of
classification models and a testing set with
and , where is the model generated at the k-th iteration, is the
confidence weight of . is the group of unknown word candidates generated for the i-th
unregistered portion, is the j-th candidate of the i-th unregistered portion, n is the number of
unregistered portions, is the number of unknown word candidates generated for the i-th
unregistered portion. Then, each base classifier and each candidate are fed to the function
CLASSIFIER to get , the probability that the candidate is an unknown word based on the
model. This probability is weighted by , and added into the corresponding summation .
Finally, the top-t candidates are chosen and returns as a set of predicted unknown works by
TOP-t-CANDIDATE.
6.3.3. Experimental Settings and Results
In the experiment, we used a corpus of 16,703 medical-related documents gathered from WWW
taken from (Theeramunkong et al., 2007) with a size of 8.4 MB for evaluation. The corpus is first
preprocessed by removing HTML tags and all undesirable punctuations. To construct a set of
features, we apply TCCs and the sorted sistring technique. After applying word segmentation on
the running text, we have detected 55,158 unregistered portions. Based on these unregistered
portions, 3,209,306 unknown word candidates are generated according to the process described
previously. Moreover, these 55,158 unregistered portions came from only 3,763 distinct words.
In practice, each group of candidates may contain one or two positive labels. Therefore, 62,489
unknown candidates were assigned as positive and 3,146,819 unknown candidates were
assigned as negative. The average number of unknown candidates in a group is around 58. Based
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
281
on preliminarily statistical analysis of the Thai lexicon, we found that the average number of
TCCs in a word is around 4.5.
In this work, to limit the number of generated unknown word candidates, the maximum
number of TCCs surrounding an unregistered portion (h) is set to nine. This number is twice of
the average number of TCCs in a word. With h=9, the number of generated unknown word
candidates becomes 100. Moreover, it is possible to use two sets of separation markers in Sect.
4.2 to reduce the number of candidates. Table 6-10 shows the number of candidates generated
with/without applying two sets of separation markers to reduce the number of candidates. The
second and fourth columns indicate the distinct and total numbers, respectively. The third and
fifth columns shows the ratio over the number of candidates generated without considering any
separation markers for the cases of distinct and total numbers, respectively.
Table 6-10: Numbers of candidates generated with/without applying two sets of separation
markers and their portions compared to ‘None’ (source: TeCho et al., 2009b)
Marker Set # Distinct % Portion # Total % Portion
None 2,567, 463 100.00 7,632,300 100.00
First Set 2,363,829 92.07 7,158,875 93.80
Second Set 1,295,737 50.47 4,241,097 55.57
First + Second Set 1,153,867 44.94 3,891,845 50.99
Exploiting a naïve Bayes classifier as the base classifier, the proposed methods, GRE-based
boosting (later for short, GRE) and V-GRE, are used to learn ensemble classifiers and to identify
an unknown word. For V-GRE, the boosting iteration is set to ten. That is, sequentially ten
classifiers are generated and used as Classification committees. Moreover, to evaluate our
proposed method in detail, we have conducted the experiments to examine the effect of eight
features, (f1)-(f8), on the classification result by comparing performance of each possible feature
combination with the others.
In the experiments, 10-fold cross validation is employed to compare the proposed methods
(GRE and VGRE) to the record-based naive Bayesian method (R-NB). The R-NB is a traditional
naive Bayesian method, where all instances in the training/testing set are assumed to be
independent of each others.
We investigate the performance of GRE, V-GRE, and R-NB when the top-t candidates with t
ranging from 1 to 10, are considered as correct answers. Table 6-11 displays the performance of
two group-based evaluations; GRE and V-GRE, as well as R-NB in cases of the all-feature set (f1-
f8) and the best-5 feature sets ((f3,f4,f7), (f3,f4,f5), (f3,f4,f8), (f2,f4,f6,f7), (f4,f6,f8)). More
precisely the all-feature set performs well at rank 12 among all possible combinations (255
methods). According to the result, a number of conclusions can be made as follows.
Firstly, V-GRE outperformed GRE in both the all-feature set and the best-5 feature sets for all
top-t ranks. For the top-1 rank of the all-feature case, V-GRE achieved an accuracy of
90.93%±0.50 while GRE gained 84.14%±0.19. For a higher rank, V-GRE still outperformed GRE
even the grap becomes smaller, e.g., at rank-10 V-GRE and GRE gains 97.90%±0.26 while GRE
gains 97.25%±0.17. V-GRE outperforms GRE with a gap of 6.79 (90.93%-84.14%) for the top-1
rank. This gap is very small for the top-10 rank, i.e., 0.01 (97.26%-97.25%). In cases of the best
feature set (f3,f4,f7), V-GRE can achieve up to 93.93%±0.22 and 98.85%±0.15 accuracy, for the
top-1 and top-10 rank, respectively while GRE obtains 84.15%±0.64 and 97.24%±0.27,
respectively. The result indicates that VGRE is superior to GRE with the gaps of 9.78 and 1.61 for
the top-1 and top-10 rank, respectively.
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
282
Secondly, V-GRE obtains higher accuracy than the record-based naive Bayesian (R-NB) does
in most cases. However, GRE may not be superior to R-NB in the case of the top-1 rank but it
outperforms R-NB in the case of the top-2 rank. Thirdly, our proposed V-GRE and GRE can find
the correct unknown words within the rank of 10 (top-10) with the relatively high accuracy of
97%-98%.
Table 6-11: Accuracy comparison among GRE, V-GRE and a naïve Bayes classifier. Here, h is set to
nine. (source: TeCho et al., 2009b)
Feature set Evaluation
Techniques
1 2 3 4 5 6 7 8 9 10
(f3,f4,f7) GRE 84.150.64 91.180.43 93.490.40 94.850.36 95.740.33 96.280.33 96.590.32 96.820.30 97.040.28 97.240.27
V-GRE 93.930.22 95.440.26 96.301.78 97.150.23 97.810.23 98.150.20 98.410.20 98.590.19 98.720.20 98.850.15
R-NB 89.960.21
(f3,f4,f5) GRE 89.480.46 90.430.41 92.570.29 94.980.41 95.360.37 95.670.34 95.820.33 96.070.29 96.180.28 96.420.26
V-GRE 93.480.46 94.430.41 94.941.18 95.360.37 95.670.34 95.820.33 95.070.29 96.180.28 96.420.26 96.570.29
R-NB 81.960.11
(f3,f4,f8) GRE 90.630.32 95.280.21 95.420.32 95.890.26 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19
V-GRE 92.630.32 95.420.32 95.880.27 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19 97.280.21
R-NB 89.700.05
(f2,f4,f6,f7) GRE 86.030.59 98.900.18 93.500.40 95.350.36 96.650.34 97.570.27 98.080.19 98.520.20 98.770.19 99.060.15
V-GRE 91.950.39 96.520.20 97.440.18 98.010.13 98.450.10 98.660.11 98.850.11 98.980.14 99.060.15 99.120.16
R-NB 89.700.08
(f4,f6,f8) GRE 78.870.49 94.620.26 87.010.46 89.250.51 90.550.43 91.650.38 92.600.30 93.430.26 94.040.29 95.130.27
V-GRE 90.780.76 93.990.42 94.830.34 95.310.33 95.730.30 96.080.34 96.410.30 96.690.29 96.850.28 97.030.25
R-NB 87.890.07
(f1-f8) GRE 84.140.19 91.710.22 93.520.33 94.860.25 95.740.24 96.290.20 96.600.17 96.830.19 97.040.22 97.250.17
V-GRE 90.930.50 94.920.43 96.050.42 96.630.43 97.040.43 97.270.40 97.490.36 97.650.34 97.790.31 97.260.26
R-NB 82.480.12
Conclusions
Section 6.3 presents an automated method to recognize unknown words from a Thai running text.
We described how to map the problem to a classification task. The naïve Bayes with a smoothing
technique classifier is investigated using eight features: number of TCCs, number of known
words, string length, number of left and right TCCs variety, probabilistic of special character
occurring on left and right, number of document found, term frequency and TFIDF scores, for
evaluating the model. In practice, the unknown word candidates actually have relationship
among them. To reduce the complexity in unknown word boundary identification, reduction
approaches are employed to decrease a number of generated unknown word candidates to 49%.
This paper also proposed the group-based ranking evaluation technique. This technique
considered the unknown word candidates as groups that can solved the unbalanced datasets
problem. To further improve the prediction of a classifier, we apply a boosting technique with
voting under group-based ranking evaluation (V-GRE). We have conducted a set of experiments
on real-world data to evaluate the performance of the proposed approach. From the experiment
results, the proposed technique achieves the accuracy of the order of 90.93%0.50 to 97.90%0.26
at the first rank and tenth rank. Our proposed ensemble method can achieve an increase in
classification accuracy of the order of 6.79% to 8.45% at the first rank when compared to the
ordinary evaluation and group-based ranking evaluation (GRE) technique, respectively. For more
detail, the reader can find in (TeCho et al., 2009b).
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND