chapter – 5shodhganga.inflibnet.ac.in/bitstream/10603/36602/15/15_chapter5.pdf · chapter-6: an...
TRANSCRIPT
Chapter-6: An Integrated Approach to Ontology Mapping Page | 105
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
CHAPTER – 5
The Study of Related Technology
5.1 WordNet
5.2 Vector Space Model
5.3 Stop Words
5.4 Porter’s Stemming Algorithm
5.5 Summary
Chapter-6: An Integrated Approach to Ontology Mapping Page | 106
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
CHAPTER – 5
THE STUDY OF RELATED TECHNOLOGY
This chapter covers the study of related technology required for
developing an integrated approach for ontology mapping and includes
the brief discussion of following:
WordNet, an on-line lexical database.
Vector Space Model, an Information Retrieval technique.
Stop Words, non-functional words.
Porter’s Stemming Algorithm, widely used algorithm for
stemming.
5.1 WordNet
5.1.1 Introduction
WordNet [41] is an electronic lexical database of the English language
developed at Princeton University under the direction of George A.
Miller. It organizes lexical information in terms of word meanings
(Semantic), rather than word forms (Morphology). Mappings between
forms and meanings are many-to-many. That is, some forms have
several different meanings (polysemy), and some meanings can be
expressed by several different forms (synonymy). The WordNet
organizes words according to such relationships.
The basic unit of meaning in WordNet is synset which represents a
group of words with a synonymous meaning. A synset denotes a
concept (or a sense) of a group of terms. A word may appear in more
than one synset indicating that a word has multiple senses. WordNet
is a semantic network of word senses.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 107
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Information in WordNet database is organized around logical
groupings called synsets. Each synset consists of a list of synonymous
word forms and semantic pointers that describe relationships between
the current synset and other synsets. A word form can be a single
word, or two or more words connected by underscores, called
collocations. The semantic pointers are used to represent the relations
such as hypernymy/hyponymy (superordinate/subordinate),
antonymy, entailment, and meronymy/holonymy between synsets.
All synsets in WordNet are categorized into four parts of speech:
nouns, verbs, adjectives, and adverbs. The current study only uses
the knowledge of nouns-organization from WordNet. It contains
different types of relations between noun-synsets, but the main
hierarchy in WordNet is built on hyponym/hypernym relations, which
are similar to subclass/superclass structure. It also provides other
relations such as meronym (part of relations). It also provides textual
descriptions (definitions and optionally examples) of the concepts,
called gloss.
Some useful Terminology used by WordNet related to noun
organization is described below:
Synset: A synset is a synonym set; a set of words that are
interchangeable in some context without changing the truth value of
the preposition in which they are embedded.
Collocation: A collocation is a sequence of two or more words,
connected by spaces or hyphens, that go together to form a specific
meaning, such as “car pool”, “blue-collar” and “Line of products”.
Base form: The base form of a word or collocation is the form to which
inflections are added.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 108
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Lemma: Lower case ASCII text of word as found in the WordNet
database index files. Usually the base form for a word or collocation.
Gloss: Each synset contains gloss consisting of a definition and
optionally example sentences.
Semantic pointer: A semantic pointer indicates a relation between
synsets.
Synonymy/Antonymy: Synonymy/Antonymy are lexical relation
between word forms, not a semantic relation between word meanings.
Synonymy: Two expressions are synonymous in a linguistic context C
if the substitution of one for the other in C does not alter the truth
value.
Antonymy: Antonyms are synsets which are opposite in meaning.
Hyponymy/Hypernymy: Unlike synonymy and antonymy, which are
lexical relations between word forms, hyponymy/hypernymy is a
semantic relation between word meanings: e.g., {maple} is a hyponym
of {tree}, and {tree} is a hyponym of {plant}. hyponymy/hypernymy are
also called subordination/superordination, subset/superset, or the
IS_A or HAS_A relation.
Hyponym (subordinate): The specific term used to designate a member
of a class. X is a hyponym of Y if X is a (kind of) Y. That is, a synset
which is a particular kind of another synset. For example, dog is a
hyponym of canine. A concept represented by the synset {x, x’, . . .} is
said to be a hyponym of the concept represented by the synset {y, y’, .
. .} if native speakers of English accept sentences constructed from
such frames as An x is a (kind of) y. Hyponymy is transitive and
asymmetrical
Chapter-6: An Integrated Approach to Ontology Mapping Page | 109
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Hypernym (super-ordinate): The generic term used to designate a
whole class of specific instances. Y is a hypernym of X if X is a (kind
of) Y. That is, a synset which is the more general class of another
synset. For example, canine is a hypernym of dog.
Coordinate: Coordinate terms are nouns or verbs that have the same
hypernym. Y is a coordinate term of X if X and Y share a hypernym.
For example, wolf is a coordinate term of dog, and dog is a coordinate
term of wolf.
Meronymy/holonymy: The Part-of/Has-Part, Member-of/Has-Member,
and Substance-of/Has-Substance relation are types of
Meronym/Holonym relation.
Meronym: The name of a constituent part of, the substance of, or a
member of something. X is a meronym of Y if X is a part of Y. That is,
a synsets which the parts of another synset. For example, window is a
meronym of building. Meronym is the part-whole (or HAS_A) relation.
A concept represented by the synset {x, x’, . . .} is a meronym of a
concept represented by the synset {y, y’, . . .} if native speakers of
English accept sentences constructed from such frames as A y has an
x (as a part) or An x is a part of y. The meronym relation is transitive
(with qualifications) and asymmetrical (Cruse, 1986), and can be used
to construct a part hierarchy with certain restriction as a meronym
can have many holonyms.
Holonym: The name of the whole of which the meronym names a part.
Y is a holonym of X if X is a part of Y. That is, a synsets which is the
whole, of which another synset is part. For example, building is a
holonym of window.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 110
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
5.1.2 Usage of WordNet
WordNet is widely used in linguistic applications as a lexical resource
to perform various tasks such as Word Sense Disambiguation,
Machine Translation, IR, and Name Entity Recognition. It is used in
ontology mapping to derive relationship between two terms (label of
concepts) t1 and t2 by exploring semantic information related to these
two concepts as described below.
A very common usage is to find synonyms for a given term. If
two terms belongs to some common synset, they are similar.
That is, t1=t2 if both are part of one synset in WordNet.
To measure distance between synsets (corresponding to two
terms) in hypernym structure. If the path connecting two
synsets is short, the similarity between two terms is high and
vice versa.
To find distance between two synsets using their glosses. It is
assumed that if there are more numbers of overlapped words in
the glosses related to two terms, they are more similar.
To decide that t1 is less general than t2 if t1 is a hyponym or
meronym of t2 (e.g., Dog << Canine).
To decide that t1 is more general than term t2 if t1 is a
hypernym or holonym of t2 (e.g., India >= Gujarat).
To determine that t1 and t2 are disjoint if they are connected by
antonym relations or they are siblings in the part of hierarchy.
Thus, WordNet is used in ontology mapping to calculate semantic
similarity between two concept-labels unlike Edit Distance that
computes the morphological similarity. Many similarity measures are
proposed based on WordNet knowledge base. For example, Lin et al.
as cited in [61] define the similarity between two senses in WordNet as
Sim_WN (ei1, ej2) = 2 X log (p(s)) / ( log(p(s1)) + log(p(s2)) )
Chapter-6: An Integrated Approach to Ontology Mapping Page | 111
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Where p(s)=count(s)/total is the probability of a randomly selected
word occurring in the synset s or any sub synsets of it, and the total
is the number of words in WordNet. The synset s is the common
hypernym of s1 and s2 in WordNet.
The Figure 27 shows the different senses for noun “Student” and
Figure 28 shows the snapshot of hyponym hierarchy for the noun
“Employee” in WordNet.
Figure 27: Senses for noun “Student” in WordNet
WordNet may be used through local GUI/CLI, or Web based interface.
It also supports API in various languages written by different WordNet
users.
5.2 Vector Space Model (VSM)
5.2.1 Introduction
Traditional Information Retrieval (IR) systems consider documents and
queries as bag-of-words assuming that the set of words in a document
is representative of its content and meaning. It also assumes that if a
Chapter-6: An Integrated Approach to Ontology Mapping Page | 111
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Where p(s)=count(s)/total is the probability of a randomly selected
word occurring in the synset s or any sub synsets of it, and the total
is the number of words in WordNet. The synset s is the common
hypernym of s1 and s2 in WordNet.
The Figure 27 shows the different senses for noun “Student” and
Figure 28 shows the snapshot of hyponym hierarchy for the noun
“Employee” in WordNet.
Figure 27: Senses for noun “Student” in WordNet
WordNet may be used through local GUI/CLI, or Web based interface.
It also supports API in various languages written by different WordNet
users.
5.2 Vector Space Model (VSM)
5.2.1 Introduction
Traditional Information Retrieval (IR) systems consider documents and
queries as bag-of-words assuming that the set of words in a document
is representative of its content and meaning. It also assumes that if a
Chapter-6: An Integrated Approach to Ontology Mapping Page | 111
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Where p(s)=count(s)/total is the probability of a randomly selected
word occurring in the synset s or any sub synsets of it, and the total
is the number of words in WordNet. The synset s is the common
hypernym of s1 and s2 in WordNet.
The Figure 27 shows the different senses for noun “Student” and
Figure 28 shows the snapshot of hyponym hierarchy for the noun
“Employee” in WordNet.
Figure 27: Senses for noun “Student” in WordNet
WordNet may be used through local GUI/CLI, or Web based interface.
It also supports API in various languages written by different WordNet
users.
5.2 Vector Space Model (VSM)
5.2.1 Introduction
Traditional Information Retrieval (IR) systems consider documents and
queries as bag-of-words assuming that the set of words in a document
is representative of its content and meaning. It also assumes that if a
Chapter-6: An Integrated Approach to Ontology Mapping Page | 112
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
document shares common words with the query, then it is likely to be
relevant answer to that query.
Figure 28: Partial hyponym hierarchy for the noun “Employee” in
WordNet
This assumption serves the underlying foundation of traditional key
word search engine, in which a user describes his or her need by
entering key words. The major IR techniques used to implement IR
system are: Boolean Matching, Extended Boolean Matching, Vector
Space Model, Probabilistic Modeling, Fuzzy Set Model, Latent
Semantic Indexing, and Neural Networks. The one of the popular
methods of implementing IR systems is VSM.
The VSM is an algebraic model used for Information Retrieval. It is
based on the assumption that the meaning of a document can be
understood from the document’s constituent terms. It represents
natural language document in a formal manner by the use of vectors
Chapter-6: An Integrated Approach to Ontology Mapping Page | 112
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
document shares common words with the query, then it is likely to be
relevant answer to that query.
Figure 28: Partial hyponym hierarchy for the noun “Employee” in
WordNet
This assumption serves the underlying foundation of traditional key
word search engine, in which a user describes his or her need by
entering key words. The major IR techniques used to implement IR
system are: Boolean Matching, Extended Boolean Matching, Vector
Space Model, Probabilistic Modeling, Fuzzy Set Model, Latent
Semantic Indexing, and Neural Networks. The one of the popular
methods of implementing IR systems is VSM.
The VSM is an algebraic model used for Information Retrieval. It is
based on the assumption that the meaning of a document can be
understood from the document’s constituent terms. It represents
natural language document in a formal manner by the use of vectors
Chapter-6: An Integrated Approach to Ontology Mapping Page | 112
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
document shares common words with the query, then it is likely to be
relevant answer to that query.
Figure 28: Partial hyponym hierarchy for the noun “Employee” in
WordNet
This assumption serves the underlying foundation of traditional key
word search engine, in which a user describes his or her need by
entering key words. The major IR techniques used to implement IR
system are: Boolean Matching, Extended Boolean Matching, Vector
Space Model, Probabilistic Modeling, Fuzzy Set Model, Latent
Semantic Indexing, and Neural Networks. The one of the popular
methods of implementing IR systems is VSM.
The VSM is an algebraic model used for Information Retrieval. It is
based on the assumption that the meaning of a document can be
understood from the document’s constituent terms. It represents
natural language document in a formal manner by the use of vectors
Chapter-6: An Integrated Approach to Ontology Mapping Page | 113
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
in a multi-dimensional space [22]. It represents terms (words) of
documents, and terms of queries in an n-dimensional term-space,
allows computing the similarities between queries and documents,
and allows the results of the computation to be ranked according to
the similarity measure between them.
Example - 30: Two Documents d1 and d2; and a Query q
Document d1=“ISTAR runs MCA program and MIT program”
Document d2=“GDCST runs MCA and PhD programs”
Query q=”GDCST”
The steps of basic VSM [76] for Example - 30 are shown below:
Step-1: Each document is broken down into a word frequency
table. The tables are called vectors and can be stored as arrays.
The word frequency table for document d1 shown in Example - 30 is:
ISTAR Runs MCA Program And MIT
1 1 1 2 1 1
The word frequency table for document d2 shown in Example - 30 is:
GDCST Runs MCA And PHD programs
1 1 1 1 1 1
Step-2: A vocabulary of distinct words is built from all the words
taken together from all documents in the system. Usually, the
vocabulary is sorted. The vocabulary for Example - 30 is:
ISTAR runs MCA Program and MIT GDCST PhD Programs
Hence, the sorted vocabulary for Example - 30 is:
And GDCST ISTAR MCA MIT PhD program programs runs
Step-3: Each document and user query is represented as a vector
against this sorted vocabulary.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 114
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
The word frequency table for document d1 with respect to above
sorted vocabulary is:
And GDCST ISTAR MCA MIT PhD Program Programs runs
1 0 1 1 1 0 2 0 1
Therefore, the vector for document d1, denoted by dv1, is:
dv1 = (1, 0, 1, 1, 1, 0, 2, 0, 1).
The word frequency table for document d2 with respect to above
sorted vocabulary is:
And GDCST ISTAR MCA MIT PhD Program Programs runs
1 1 0 1 0 1 0 1 1
Therefore, the vector for document d2, denoted by dv2, is:
dv2 = (1, 1, 0, 1, 0, 1, 0, 1, 1).
User’s query can be represented as vector in the same way as
documents. The vector for the query shown in Example - 30 is:
qv = (0, 1, 0, 0, 0, 0, 0, 0, 0)
Step-4: Calculate similarity measure of query with every document
in the collection
Using a similarity measure, a set of documents can be compared to a
query and the most similar documents are returned. The similarity in
VSM is determined by using associative coefficients based on the
inner product of the document vector and query vector, where word
overlap indicates similarity. There are many different ways to measure
how similar two vectors are, such as Cosine Measure, Dice Coefficient,
and Jaccard Coefficient. The most popular similarity measure is the
cosine coefficient, which measures the angle between a document
vector and query vector in a high-dimensional virtual space [67] as
depicted in Figure 29. In practice, it is easier to calculate the cosine of
the angle between the vectors, called cosine similarity, instead of the
angle itself.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 115
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Figure 29: Cosine of the angle between the vectors d1, d2 and query q
For two vectors dv1 and dv2 the cosine similarity, denoted by dv1 X
dv2, is given by:
dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |
Here, dv1 * dv2 is the vector product of dv1 and dv2, calculated by
multiplying corresponding frequencies together. The |dv1| and |dv2|
are norms of vector dv1 and dv2, respectively. The norm of vector q,
denoted by |q|, is given by:
The similarity between document d1 and query q, denoted by (d1, q),
is:
(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0
SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)
= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0
= 0
Chapter-6: An Integrated Approach to Ontology Mapping Page | 115
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Figure 29: Cosine of the angle between the vectors d1, d2 and query q
For two vectors dv1 and dv2 the cosine similarity, denoted by dv1 X
dv2, is given by:
dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |
Here, dv1 * dv2 is the vector product of dv1 and dv2, calculated by
multiplying corresponding frequencies together. The |dv1| and |dv2|
are norms of vector dv1 and dv2, respectively. The norm of vector q,
denoted by |q|, is given by:
The similarity between document d1 and query q, denoted by (d1, q),
is:
(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0
SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)
= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0
= 0
Chapter-6: An Integrated Approach to Ontology Mapping Page | 115
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Figure 29: Cosine of the angle between the vectors d1, d2 and query q
For two vectors dv1 and dv2 the cosine similarity, denoted by dv1 X
dv2, is given by:
dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |
Here, dv1 * dv2 is the vector product of dv1 and dv2, calculated by
multiplying corresponding frequencies together. The |dv1| and |dv2|
are norms of vector dv1 and dv2, respectively. The norm of vector q,
denoted by |q|, is given by:
The similarity between document d1 and query q, denoted by (d1, q),
is:
(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0
SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)
= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0
= 0
Chapter-6: An Integrated Approach to Ontology Mapping Page | 116
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
|dv1| = SQRT (12+02+12+12+12+02+22+02+12) = SQRT (9) = 3
|q| = SQRT (02+12+02+02+02+02+02+02+02) = SQRT (1) = 1
The similarity between document d2 and query q is
(d2, q) = (dv2 * q) / | dv2 | * | q | = 1/(2.45 X 1) = 0.41
Since
Dv2 * q = (1, 1, 0, 1, 0, 1, 0, 1, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)
= 1X0 + 1X1 + 0X0 + 1X0 + 0X0 + 1X0 + 0X0 + 1X0 + 1X0
= 1
|dv2| = SQRT (12+12+02+12+02+12+02+12+12) = SQRT (6) = 2.45
|q| = SQRT (02+12+02+02+02+02+02+02+02) = SQRT (1) = 1
Step-5: Rank the documents for the relevance and display to user
Relevance ranking of documents is calculated by comparing the
deviation of angles between each document vector and the query
vector. That is, the query is compared to all documents using a
similarity measure and the user is shown the documents in
decreasing order of their similarity to the query term. The rank of
documents d1 and d2 with respect to their similarity with query q is
given in following table. It indicates that document d2 is more relevant
to query q than document d1.
Query Document Similarity
Score
Rank
Q D2 0.41 1
Q D1 0.00 2
Chapter-6: An Integrated Approach to Ontology Mapping Page | 117
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Generally, this ranking information based on similarity score of every
term in the collection of documents to every document in the
collection of documents is kept ready by pre-calculating it and when a
query term is posed, top ranked documents for that query term is
returned to user.
5.2.2 Variation in VSM
The various additional refinements used in VSM include the notion of
Term Weighting, Normalized Term Frequency, Inverse Document
Frequency, and Stemming.
Not all words are equally useful. A word is most likely to be highly
relevant to document d if it is infrequent in other documents and
frequent in document d. The cosine measure needs to be modified to
reflect this. This is done with the concept of tf-idf measure. Where tf is
term frequency and idf is inverse document frequency as described
below.
5.2.2.1 Term weighting/Term Frequency (tf)
The simple approach to assign the weight to a term is to calculate the
number of occurrences of term t in document d. This weighting
scheme is called term frequency (tf), denoted by tf (t, d), and is given
by:
Tf (t, d) =f (t, d), where f (t, d) is the frequency of term t in document d.
It may be scaled by logarithmic value. For example,
Tf (t, d) = 1 + Log(f(t, d)) OR 0 when f (t, d)=0.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 118
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
5.2.2.2 The normalized term frequency
The tf is generally normalized with respect to the term having the
highest number of occurrences in a document in order to prevent bias
towards longer documents. That is, to prevent large document scoring
high over small document. It is given by:
NTF(t, d) = tf(t,d) / max(tf(wi,d)), where wi is any word in document d.
Or it may be given by:
NTf (t, d) = log (1 + n(d, t) / n(d) ),
Where n(d, t) is number of occurrences of the term t in document d;
and n(d) is the number of terms in document d [67]. The tf, in
following sections, represents either tf or ntf.
5.2.2.3 Inverse Document Frequency (IDF)
The tf alone considers all terms equally important (having same
discriminating power) in determining the relevance, including the
terms that appears in all most all documents. To discriminate the
importance of common terms that occur too often in the collection, the
concept of collection frequency (cf) is used. It scales down the tf of
terms with high collection frequency. The cf is defined to be the total
number of occurrences of a term in the collection of documents. To be
more precise in determining the discriminating power of terms,
document frequency (df) is used instead of cf. The df is defined to be
the number of documents in the collection that contain a term t, and
is denoted by df(t, D). Based on this concept, an Inverse document
frequency is a calculated which is designed to make rare words more
important than common words. IDF provides high values for rare
words and low values for common words [67] making rare words more
important than common words.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 119
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
The idf of a term t with respect to collection of document D is given by:
Idf (t, D) = Log (N/df(t,D))
Where, N is the total number of documents in a collection, and df(t,D)
is the number of documents containing the term t.
The base of the log function is immaterial as it is simply used to
constitute a constant multiplicative factor. For a query containing
novel term, denominator may be zero. Hence, sometimes, 1+dfi is used
instead of dfi in the denominator part of above formula.
5.2.2.4 Tf-idf weighting
The TF-IDF is a composite weight assigned to term t in document d. It
combines tf (local parameter) and idf (global parameter). It reflects how
important a word is to a document in a collection.
It is given by:
Tf-idf(t, d, D) = tf (t, d) X idf (t, D)
The tf-idf value increases proportionally to the number of times a term
appears in the document, but is offset by the frequency of the term in
the collection.
5.2.2.5 Stop words
Commonly occurring words are unlikely to give useful information and
may be removed from the vocabulary to expedite the processing since
it reduces the term-space dimensions.
5.2.2.6 Stemming
Stemming is the process of removing suffixes from words to get the
common origin. In statistical analysis, it greatly helps when
Chapter-6: An Integrated Approach to Ontology Mapping Page | 120
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
comparing texts to be able to identify words with a common meaning
and form as being identical. For example, we would like to count the
words stopped and stopping as being the same and derived from stop.
Stemming identifies these common forms [09].
5.2.3 Pros and Cons of VSM
Pros
Simple model based on linear algebra and fairly cheap to
compute.
Yields decent effectiveness. Allows computing a continuous
degree of similarity between queries and documents, and hence
allows ranking documents according to their relevance with
query.
Very popular.
Cons
No theoretical foundation, i.e., there is no real theoretical basis
for the assumption of a term space. It is more for visualization
than having any real basis.
Most similarity measures work about the same regardless of
model.
Weighting is intuitive but not very formal. Weights in the vectors
are very arbitrary.
Terms are not really orthogonal dimensions.
Assumes term independence. Terms are not independent of
other terms in the document.
Results in low recall if query and documents use different
vocabulary to represent the same concept.
The order of terms is ignored due to bag-of-words model.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 121
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
5.3 Stop WordsStop words are non context bearing words that are excluded from
process by many IR systems. Sometimes, it may include frequently
occurring words specific to application domain in which they needs to
be ignored. For example, while processing HTML documents, all HTML
tags may be considered as stop words. The stop-words list may be
pre-defined or may be generated dynamically for the given corps of
documents. The removal of stop words may expedite the process.
Though, stop words elimination must be exercised carefully;
otherwise, it may affect the system adversely. The list of commonly
used stop words is available at following URLs (last accessed date: 09-
03-2011).
http://www.textfixer.com/resources/common-english-words.txt
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-
words/
http://www.lextek.com/manuals/onix/stopwords1.html
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
The Table 5 shows the top 20 Stop words according to their average
frequency per 1000 words, which are: the, of, and, to, a, in, that, is,
was, he, for, it, with, as, his, on, be, at, by, and I [24]. The Figure 30
lists most commonly used stop words that balance the coverage and
size of stop-word set.
5.4 Porter’s Stemming Algorithm
The morphological variants of words have similar semantic
interpretations and can be considered as equivalent in some context.
Due to this reason, many applications that needs to compare terms
(words) uses stemming process to reduce a word to its stem
(root/base) form.
Chapter-6: An Integrated Approach to Ontology Mapping Page | 122
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
Table 5: The top 20 stop words
Rank Word Per 1000 Rank Word Per 1000
1 THE 70 11 FOR 9
2 OF 36 12 IT 9
3 AND 29 13 WITH 7
4 TO 26 14 AS 7
5 A 23 15 HIS 7
6 IN 21 16 ON 7
7 THAT 11 17 BE 6
8 IS 10 18 AT 5
9 WAS 10 19 BY 5
10 HE 10 20 I 5
Using stemming algorithms, called stemmers, the terms having
morphological variants are conflated to a single representative form.
This reduces the term space to be processed. Thus, it saves not only
storage space but also reduces the processing time.
However, it is to be noted that stemming does not give 100% success
rate [55]. For example, connect, connected, connecting, connection,
and connections are having similar meaning which are reduced to
single stem by stemmer, while it is not true for wand and wander.
Despite this fact, Stemming algorithms are used extensively by
thesauruses, NLP, linguistic applications, search engines, and many
other IR applications. The examples of different stemming algorithms
are [69]:
Paice/Husk Stemming Algorithm
Porter Stemming Algorithm
Lovins Stemming Algorithm
Dawson Stemming Algorithm
Krovetz Stemming Algorithm
Chapter-6: An Integrated Approach to Ontology Mapping Page | 123
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
a, able, about, above, across, after, afterwards, again, against, all, almost, alone,along, already, also, although, always, am, among, amongst, amount, an, and,another, any, anybody, anyhow, anyone, anything, anyway, anywhere, are, area,areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing,backs, be, became, because, become, becomes, becoming, been, before, beforehand,began, behind, being, beings, below, beside, besides, best, better, between, beyond,big, bill, both, bottom, but, by, c, call, came, can, cannot, cant, case, cases, certain,certainly, clear, clearly, co, come, computer, con, could, couldn’t, cry, d, de, dear,describe, detail, did, differ, different, differently, do, does, done, down, downed,downing, downs, due, during, e, each, early, e.g., eight, either, eleven, else,elsewhere, empty, end, ended, ending, ends, enough, etc, even, evenly, ever, every,everybody, everyone, everything, everywhere, except, f, face, faces, fact, facts, far,felt, few, fifteen, fifty, fill, find, finds, fire, first, five, for, former, formerly, forty,found, four, from, front, full, fully, further, furthered, furthering, furthers, g, gave,general, generally, get, gets, give, given, gives, go, going, good, goods, got, great,greater, greatest, group, grouped, grouping, groups, h, had, has, hasn’t, have,having, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, high,higher, highest, him, himself, his, how, however, hundred, i, i.e., if, important, in,inc, indeed, interest, interested, interesting, interests, into, is, it, its, itself, j, just, k,keep, keeps, kind, knew, know, known, knows, l, large, largely, last, later, latest,latter, latterly, least, less, let, lets, like, likely, long, longer, longest, ltd, m, made,make, making, man, many, may, me, meanwhile, member, members, men, might,mill, mine, more, moreover, most, mostly, move, Mr., Mrs., much, must, my, myself,n, name, namely, necessary, need, needed, needing, needs, neither, never,nevertheless, new, newer, newest, next, nine, no, nobody, non, none, nor, not,nothing, now, nowhere, number, numbers, o, of, off, often, old, older, oldest, on,once, one, only, onto, open, opened, opening, opens, or, order, ordered, ordering,orders, other, others, otherwise, our, ours, ourselves, out, over, own, p, part, parted,parting, parts, per, perhaps, place, places, please, point, pointed, pointing, points,possible, present, presented, presenting, presents, problem, problems, put, puts, q,quite, r, rather, re, really, right, room, rooms, s, said, same, saw, say, says, second,seconds, see, seem, seemed, seeming, seems, sees, serious, several, shall, she,should, show, showed, showing, shows, side, sides, since, sincere, six, sixty, small,smaller, smallest, so, some, somebody, somehow, someone, something, sometime,sometimes, somewhere, state, states, still, such, sure, system, t, take, taken, ten,than, that, the, their, them, themselves, then, thence, there, thereafter, thereby,therefore, therein, thereupon, these, they, thick, thin, thing, things, think, thinks,third, this, those, though, thought, thoughts, three, through, throughout, thru,thus, to, today, together, too, took, top, toward, towards, turn, turned, turning,turns, twelve, twenty, two, u, un, under, until, up, upon, us, use, used, uses, v,very, via, w, want, wanted, wanting, wants, was, way, ways, we, well, wells, went,were, what, whatever, when, whence, whenever, where, whereas, whereby, wherein,whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom,whose, why, will, with, within, without, work, worked, working, works, would, x, y,year, years, yet, you, young, younger, youngest, your, yours, yourself, yourselves, z,
Figure 30: List of Most Common Stop words
The one of the most popular stemming algorithm is Porter Stemming
Algorithm. It is used, as part of a term normalization process, to
remove the common morphological and inflectional endings from
words in English. In English, most of the morphological variation
Chapter-6: An Integrated Approach to Ontology Mapping Page | 124
An Integrated Approach to Improve Ontology Mapping Process in Semantic Web
takes place at the right-hand end of a word-form and hence stemmer
for English is sometimes also known as suffix tripping algorithm. The
Porter Stemming Algorithm uses explicit list of suffixes, and a
criterion, for each suffix, under which it can be removed from a word
to leave a valid stem. It treats complex suffixes as compounds made
up of simple suffixes, and removes simple suffixes in a number of
steps. The complete description of an algorithm is given in [55]. The
original stemmer was written in BCPL programming language, but it is
adapted in various modern programming languages.
5.5 Summary
WordNet organizes information around logical groupings called
synsets. Each synset consists of a list of synonymous word forms and
semantic pointers that describe relationships between the current
synset and other synsets. It is widely used to derive semantic
similarity between words.
The VSM is an algebraic model used for Information Retrieval. It is
based on the assumption that the meaning of a document can be
understood from the document’s constituent terms. It allows ranking
documents for their relevance to query using a similarity measure
such as cosine similarity.
Stop words, also known as non context bearing words, are commonly
occurring words that are unlikely to give useful information. Hence,
they are excluded from process by many IR systems to reduce the
storage space and processing time.
The words with the same meaning may appear in various
morphological forms. To capture their similarity they are normalized
using stemming algorithm into a common root/base form, called
stem.