chapter – 5shodhganga.inflibnet.ac.in/bitstream/10603/36602/15/15_chapter5.pdf · chapter-6: an...

Chapter-6: An Integrated Approach to Ontology Mapping Page | 105

An Integrated Approach to Improve Ontology Mapping Process in Semantic Web

CHAPTER – 5

The Study of Related Technology

5.1 WordNet

5.2 Vector Space Model

5.3 Stop Words

5.4 Porter’s Stemming Algorithm

5.5 Summary



CHAPTER – 5

THE STUDY OF RELATED TECHNOLOGY

This chapter covers the study of related technology required for

developing an integrated approach for ontology mapping and includes

the brief discussion of following:

WordNet, an on-line lexical database.

Vector Space Model, an Information Retrieval technique.

Stop Words, non-functional words.

Porter’s Stemming Algorithm, widely used algorithm for

stemming.

5.1 WordNet

5.1.1 Introduction

WordNet [41] is an electronic lexical database of the English language

developed at Princeton University under the direction of George A.

Miller. It organizes lexical information in terms of word meanings

(Semantic), rather than word forms (Morphology). Mappings between

forms and meanings are many-to-many. That is, some forms have

several different meanings (polysemy), and some meanings can be

expressed by several different forms (synonymy). The WordNet

organizes words according to such relationships.

The basic unit of meaning in WordNet is synset which represents a

group of words with a synonymous meaning. A synset denotes a

concept (or a sense) of a group of terms. A word may appear in more

than one synset indicating that a word has multiple senses. WordNet

is a semantic network of word senses.



Information in WordNet database is organized around logical

groupings called synsets. Each synset consists of a list of synonymous

word forms and semantic pointers that describe relationships between

the current synset and other synsets. A word form can be a single

word, or two or more words connected by underscores, called

collocations. The semantic pointers are used to represent the relations

such as hypernymy/hyponymy (superordinate/subordinate),

antonymy, entailment, and meronymy/holonymy between synsets.

All synsets in WordNet are categorized into four parts of speech:

nouns, verbs, adjectives, and adverbs. The current study only uses

the knowledge of nouns-organization from WordNet. It contains

different types of relations between noun-synsets, but the main

hierarchy in WordNet is built on hyponym/hypernym relations, which

are similar to subclass/superclass structure. It also provides other

relations such as meronym (part of relations). It also provides textual

descriptions (definitions and optionally examples) of the concepts,

called gloss.

Some useful Terminology used by WordNet related to noun

organization is described below:

Synset: A synset is a synonym set; a set of words that are

interchangeable in some context without changing the truth value of

the preposition in which they are embedded.

Collocation: A collocation is a sequence of two or more words,

connected by spaces or hyphens, that go together to form a specific

meaning, such as “car pool”, “blue-collar” and “Line of products”.

Base form: The base form of a word or collocation is the form to which

inflections are added.



Lemma: Lower case ASCII text of word as found in the WordNet

database index files. Usually the base form for a word or collocation.

Gloss: Each synset contains gloss consisting of a definition and

optionally example sentences.

Semantic pointer: A semantic pointer indicates a relation between

synsets.

Synonymy/Antonymy: Synonymy/Antonymy are lexical relation

between word forms, not a semantic relation between word meanings.

Synonymy: Two expressions are synonymous in a linguistic context C

if the substitution of one for the other in C does not alter the truth

value.

Antonymy: Antonyms are synsets which are opposite in meaning.

Hyponymy/Hypernymy: Unlike synonymy and antonymy, which are

lexical relations between word forms, hyponymy/hypernymy is a

semantic relation between word meanings: e.g., {maple} is a hyponym

of {tree}, and {tree} is a hyponym of {plant}. hyponymy/hypernymy are

also called subordination/superordination, subset/superset, or the

IS_A or HAS_A relation.

Hyponym (subordinate): The specific term used to designate a member

of a class. X is a hyponym of Y if X is a (kind of) Y. That is, a synset

which is a particular kind of another synset. For example, dog is a

hyponym of canine. A concept represented by the synset {x, x’, . . .} is

said to be a hyponym of the concept represented by the synset {y, y’, .

. .} if native speakers of English accept sentences constructed from

such frames as An x is a (kind of) y. Hyponymy is transitive and

asymmetrical



Hypernym (super-ordinate): The generic term used to designate a

whole class of specific instances. Y is a hypernym of X if X is a (kind

of) Y. That is, a synset which is the more general class of another

synset. For example, canine is a hypernym of dog.

Coordinate: Coordinate terms are nouns or verbs that have the same

hypernym. Y is a coordinate term of X if X and Y share a hypernym.

For example, wolf is a coordinate term of dog, and dog is a coordinate

term of wolf.

Meronymy/holonymy: The Part-of/Has-Part, Member-of/Has-Member,

and Substance-of/Has-Substance relation are types of

Meronym/Holonym relation.

Meronym: The name of a constituent part of, the substance of, or a

member of something. X is a meronym of Y if X is a part of Y. That is,

a synsets which the parts of another synset. For example, window is a

meronym of building. Meronym is the part-whole (or HAS_A) relation.

A concept represented by the synset {x, x’, . . .} is a meronym of a

concept represented by the synset {y, y’, . . .} if native speakers of

English accept sentences constructed from such frames as A y has an

x (as a part) or An x is a part of y. The meronym relation is transitive

(with qualifications) and asymmetrical (Cruse, 1986), and can be used

to construct a part hierarchy with certain restriction as a meronym

can have many holonyms.

Holonym: The name of the whole of which the meronym names a part.

Y is a holonym of X if X is a part of Y. That is, a synsets which is the

whole, of which another synset is part. For example, building is a

holonym of window.



5.1.2 Usage of WordNet

WordNet is widely used in linguistic applications as a lexical resource

to perform various tasks such as Word Sense Disambiguation,

Machine Translation, IR, and Name Entity Recognition. It is used in

ontology mapping to derive relationship between two terms (label of

concepts) t1 and t2 by exploring semantic information related to these

two concepts as described below.

A very common usage is to find synonyms for a given term. If

two terms belongs to some common synset, they are similar.

That is, t1=t2 if both are part of one synset in WordNet.

To measure distance between synsets (corresponding to two

terms) in hypernym structure. If the path connecting two

synsets is short, the similarity between two terms is high and

vice versa.

To find distance between two synsets using their glosses. It is

assumed that if there are more numbers of overlapped words in

the glosses related to two terms, they are more similar.

To decide that t1 is less general than t2 if t1 is a hyponym or

meronym of t2 (e.g., Dog << Canine).

To decide that t1 is more general than term t2 if t1 is a

hypernym or holonym of t2 (e.g., India >= Gujarat).

To determine that t1 and t2 are disjoint if they are connected by

antonym relations or they are siblings in the part of hierarchy.

Thus, WordNet is used in ontology mapping to calculate semantic

similarity between two concept-labels unlike Edit Distance that

computes the morphological similarity. Many similarity measures are

proposed based on WordNet knowledge base. For example, Lin et al.

as cited in [61] define the similarity between two senses in WordNet as

Sim_WN (ei1, ej2) = 2 X log (p(s)) / ( log(p(s1)) + log(p(s2)) )



Where p(s)=count(s)/total is the probability of a randomly selected

word occurring in the synset s or any sub synsets of it, and the total

is the number of words in WordNet. The synset s is the common

hypernym of s1 and s2 in WordNet.

The Figure 27 shows the different senses for noun “Student” and

Figure 28 shows the snapshot of hyponym hierarchy for the noun

“Employee” in WordNet.

Figure 27: Senses for noun “Student” in WordNet

WordNet may be used through local GUI/CLI, or Web based interface.

It also supports API in various languages written by different WordNet

users.

5.2 Vector Space Model (VSM)

5.2.1 Introduction

Traditional Information Retrieval (IR) systems consider documents and

queries as bag-of-words assuming that the set of words in a document

is representative of its content and meaning. It also assumes that if a













users.


5.2.1 Introduction
















users.


5.2.1 Introduction






document shares common words with the query, then it is likely to be

relevant answer to that query.

Figure 28: Partial hyponym hierarchy for the noun “Employee” in

WordNet

This assumption serves the underlying foundation of traditional key

word search engine, in which a user describes his or her need by

entering key words. The major IR techniques used to implement IR

system are: Boolean Matching, Extended Boolean Matching, Vector

Space Model, Probabilistic Modeling, Fuzzy Set Model, Latent

Semantic Indexing, and Neural Networks. The one of the popular

methods of implementing IR systems is VSM.

The VSM is an algebraic model used for Information Retrieval. It is

based on the assumption that the meaning of a document can be

understood from the document’s constituent terms. It represents

natural language document in a formal manner by the use of vectors






WordNet

















WordNet














in a multi-dimensional space [22]. It represents terms (words) of

documents, and terms of queries in an n-dimensional term-space,

allows computing the similarities between queries and documents,

and allows the results of the computation to be ranked according to

the similarity measure between them.

Example - 30: Two Documents d1 and d2; and a Query q

Document d1=“ISTAR runs MCA program and MIT program”

Document d2=“GDCST runs MCA and PhD programs”

Query q=”GDCST”

The steps of basic VSM [76] for Example - 30 are shown below:

Step-1: Each document is broken down into a word frequency

table. The tables are called vectors and can be stored as arrays.

The word frequency table for document d1 shown in Example - 30 is:

ISTAR Runs MCA Program And MIT

1 1 1 2 1 1

The word frequency table for document d2 shown in Example - 30 is:

GDCST Runs MCA And PHD programs

1 1 1 1 1 1

Step-2: A vocabulary of distinct words is built from all the words

taken together from all documents in the system. Usually, the

vocabulary is sorted. The vocabulary for Example - 30 is:

ISTAR runs MCA Program and MIT GDCST PhD Programs

Hence, the sorted vocabulary for Example - 30 is:

And GDCST ISTAR MCA MIT PhD program programs runs

Step-3: Each document and user query is represented as a vector

against this sorted vocabulary.



The word frequency table for document d1 with respect to above

sorted vocabulary is:

And GDCST ISTAR MCA MIT PhD Program Programs runs

1 0 1 1 1 0 2 0 1

Therefore, the vector for document d1, denoted by dv1, is:

dv1 = (1, 0, 1, 1, 1, 0, 2, 0, 1).

The word frequency table for document d2 with respect to above

sorted vocabulary is:

And GDCST ISTAR MCA MIT PhD Program Programs runs

1 1 0 1 0 1 0 1 1

Therefore, the vector for document d2, denoted by dv2, is:

dv2 = (1, 1, 0, 1, 0, 1, 0, 1, 1).

User’s query can be represented as vector in the same way as

documents. The vector for the query shown in Example - 30 is:

qv = (0, 1, 0, 0, 0, 0, 0, 0, 0)

Step-4: Calculate similarity measure of query with every document

in the collection

Using a similarity measure, a set of documents can be compared to a

query and the most similar documents are returned. The similarity in

VSM is determined by using associative coefficients based on the

inner product of the document vector and query vector, where word

overlap indicates similarity. There are many different ways to measure

how similar two vectors are, such as Cosine Measure, Dice Coefficient,

and Jaccard Coefficient. The most popular similarity measure is the

cosine coefficient, which measures the angle between a document

vector and query vector in a high-dimensional virtual space [67] as

depicted in Figure 29. In practice, it is easier to calculate the cosine of

the angle between the vectors, called cosine similarity, instead of the

angle itself.



Figure 29: Cosine of the angle between the vectors d1, d2 and query q

For two vectors dv1 and dv2 the cosine similarity, denoted by dv1 X

dv2, is given by:

dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |

Here, dv1 * dv2 is the vector product of dv1 and dv2, calculated by

multiplying corresponding frequencies together. The |dv1| and |dv2|

are norms of vector dv1 and dv2, respectively. The norm of vector q,

denoted by |q|, is given by:

The similarity between document d1 and query q, denoted by (d1, q),

is:

(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0

SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)

= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0

= 0





dv2, is given by:

dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |






is:

(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0

SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)

= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0

= 0





dv2, is given by:

dv1x dv2 = (dv1 * dv2) / | dv1 | * | dv2 |






is:

(d1, q) = (dv1 * q) / | dv1 | * | q | = 0/(3 X 1) = 0

SinceDv1 * q = (1, 0, 1, 1, 1, 0, 2, 0, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)

= 1X0 + 0X1 + 1X0 + 1X0 + 1X0 + 0X0 + 2X0 + 0X0 + 1X0

= 0



|dv1| = SQRT (12+02+12+12+12+02+22+02+12) = SQRT (9) = 3

|q| = SQRT (02+12+02+02+02+02+02+02+02) = SQRT (1) = 1

The similarity between document d2 and query q is

(d2, q) = (dv2 * q) / | dv2 | * | q | = 1/(2.45 X 1) = 0.41

Since

Dv2 * q = (1, 1, 0, 1, 0, 1, 0, 1, 1) . (0, 1, 0, 0, 0, 0, 0, 0, 0)

= 1X0 + 1X1 + 0X0 + 1X0 + 0X0 + 1X0 + 0X0 + 1X0 + 1X0

= 1

|dv2| = SQRT (12+12+02+12+02+12+02+12+12) = SQRT (6) = 2.45

|q| = SQRT (02+12+02+02+02+02+02+02+02) = SQRT (1) = 1

Step-5: Rank the documents for the relevance and display to user

Relevance ranking of documents is calculated by comparing the

deviation of angles between each document vector and the query

vector. That is, the query is compared to all documents using a

similarity measure and the user is shown the documents in

decreasing order of their similarity to the query term. The rank of

documents d1 and d2 with respect to their similarity with query q is

given in following table. It indicates that document d2 is more relevant

to query q than document d1.

Query Document Similarity

Score

Rank

Q D2 0.41 1

Q D1 0.00 2



Generally, this ranking information based on similarity score of every

term in the collection of documents to every document in the

collection of documents is kept ready by pre-calculating it and when a

query term is posed, top ranked documents for that query term is

returned to user.

5.2.2 Variation in VSM

The various additional refinements used in VSM include the notion of

Term Weighting, Normalized Term Frequency, Inverse Document

Frequency, and Stemming.

Not all words are equally useful. A word is most likely to be highly

relevant to document d if it is infrequent in other documents and

frequent in document d. The cosine measure needs to be modified to

reflect this. This is done with the concept of tf-idf measure. Where tf is

term frequency and idf is inverse document frequency as described

below.

5.2.2.1 Term weighting/Term Frequency (tf)

The simple approach to assign the weight to a term is to calculate the

number of occurrences of term t in document d. This weighting

scheme is called term frequency (tf), denoted by tf (t, d), and is given

by:

Tf (t, d) =f (t, d), where f (t, d) is the frequency of term t in document d.

It may be scaled by logarithmic value. For example,

Tf (t, d) = 1 + Log(f(t, d)) OR 0 when f (t, d)=0.



5.2.2.2 The normalized term frequency

The tf is generally normalized with respect to the term having the

highest number of occurrences in a document in order to prevent bias

towards longer documents. That is, to prevent large document scoring

high over small document. It is given by:

NTF(t, d) = tf(t,d) / max(tf(wi,d)), where wi is any word in document d.

Or it may be given by:

NTf (t, d) = log (1 + n(d, t) / n(d) ),

Where n(d, t) is number of occurrences of the term t in document d;

and n(d) is the number of terms in document d [67]. The tf, in

following sections, represents either tf or ntf.

5.2.2.3 Inverse Document Frequency (IDF)

The tf alone considers all terms equally important (having same

discriminating power) in determining the relevance, including the

terms that appears in all most all documents. To discriminate the

importance of common terms that occur too often in the collection, the

concept of collection frequency (cf) is used. It scales down the tf of

terms with high collection frequency. The cf is defined to be the total

number of occurrences of a term in the collection of documents. To be

more precise in determining the discriminating power of terms,

document frequency (df) is used instead of cf. The df is defined to be

the number of documents in the collection that contain a term t, and

is denoted by df(t, D). Based on this concept, an Inverse document

frequency is a calculated which is designed to make rare words more

important than common words. IDF provides high values for rare

words and low values for common words [67] making rare words more

important than common words.



The idf of a term t with respect to collection of document D is given by:

Idf (t, D) = Log (N/df(t,D))

Where, N is the total number of documents in a collection, and df(t,D)

is the number of documents containing the term t.

The base of the log function is immaterial as it is simply used to

constitute a constant multiplicative factor. For a query containing

novel term, denominator may be zero. Hence, sometimes, 1+dfi is used

instead of dfi in the denominator part of above formula.

5.2.2.4 Tf-idf weighting

The TF-IDF is a composite weight assigned to term t in document d. It

combines tf (local parameter) and idf (global parameter). It reflects how

important a word is to a document in a collection.

It is given by:

Tf-idf(t, d, D) = tf (t, d) X idf (t, D)

The tf-idf value increases proportionally to the number of times a term

appears in the document, but is offset by the frequency of the term in

the collection.

5.2.2.5 Stop words

Commonly occurring words are unlikely to give useful information and

may be removed from the vocabulary to expedite the processing since

it reduces the term-space dimensions.

5.2.2.6 Stemming

Stemming is the process of removing suffixes from words to get the

common origin. In statistical analysis, it greatly helps when



comparing texts to be able to identify words with a common meaning

and form as being identical. For example, we would like to count the

words stopped and stopping as being the same and derived from stop.

Stemming identifies these common forms [09].

5.2.3 Pros and Cons of VSM

Pros

Simple model based on linear algebra and fairly cheap to

compute.

Yields decent effectiveness. Allows computing a continuous

degree of similarity between queries and documents, and hence

allows ranking documents according to their relevance with

query.

Very popular.

Cons

No theoretical foundation, i.e., there is no real theoretical basis

for the assumption of a term space. It is more for visualization

than having any real basis.

Most similarity measures work about the same regardless of

model.

Weighting is intuitive but not very formal. Weights in the vectors

are very arbitrary.

Terms are not really orthogonal dimensions.

Assumes term independence. Terms are not independent of

other terms in the document.

Results in low recall if query and documents use different

vocabulary to represent the same concept.

The order of terms is ignored due to bag-of-words model.



5.3 Stop WordsStop words are non context bearing words that are excluded from

process by many IR systems. Sometimes, it may include frequently

occurring words specific to application domain in which they needs to

be ignored. For example, while processing HTML documents, all HTML

tags may be considered as stop words. The stop-words list may be

pre-defined or may be generated dynamically for the given corps of

documents. The removal of stop words may expedite the process.

Though, stop words elimination must be exercised carefully;

otherwise, it may affect the system adversely. The list of commonly

used stop words is available at following URLs (last accessed date: 09-

03-2011).

http://www.textfixer.com/resources/common-english-words.txt

http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-

words/

http://www.lextek.com/manuals/onix/stopwords1.html

http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words

The Table 5 shows the top 20 Stop words according to their average

frequency per 1000 words, which are: the, of, and, to, a, in, that, is,

was, he, for, it, with, as, his, on, be, at, by, and I [24]. The Figure 30

lists most commonly used stop words that balance the coverage and

size of stop-word set.

5.4 Porter’s Stemming Algorithm

The morphological variants of words have similar semantic

interpretations and can be considered as equivalent in some context.

Due to this reason, many applications that needs to compare terms

(words) uses stemming process to reduce a word to its stem

(root/base) form.



Table 5: The top 20 stop words

Rank Word Per 1000 Rank Word Per 1000

1 THE 70 11 FOR 9

2 OF 36 12 IT 9

3 AND 29 13 WITH 7

4 TO 26 14 AS 7

5 A 23 15 HIS 7

6 IN 21 16 ON 7

7 THAT 11 17 BE 6

8 IS 10 18 AT 5

9 WAS 10 19 BY 5

10 HE 10 20 I 5

Using stemming algorithms, called stemmers, the terms having

morphological variants are conflated to a single representative form.

This reduces the term space to be processed. Thus, it saves not only

storage space but also reduces the processing time.

However, it is to be noted that stemming does not give 100% success

rate [55]. For example, connect, connected, connecting, connection,

and connections are having similar meaning which are reduced to

single stem by stemmer, while it is not true for wand and wander.

Despite this fact, Stemming algorithms are used extensively by

thesauruses, NLP, linguistic applications, search engines, and many

other IR applications. The examples of different stemming algorithms

are [69]:

Paice/Husk Stemming Algorithm

Porter Stemming Algorithm

Lovins Stemming Algorithm

Dawson Stemming Algorithm

Krovetz Stemming Algorithm



a, able, about, above, across, after, afterwards, again, against, all, almost, alone,along, already, also, although, always, am, among, amongst, amount, an, and,another, any, anybody, anyhow, anyone, anything, anyway, anywhere, are, area,areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing,backs, be, became, because, become, becomes, becoming, been, before, beforehand,began, behind, being, beings, below, beside, besides, best, better, between, beyond,big, bill, both, bottom, but, by, c, call, came, can, cannot, cant, case, cases, certain,certainly, clear, clearly, co, come, computer, con, could, couldn’t, cry, d, de, dear,describe, detail, did, differ, different, differently, do, does, done, down, downed,downing, downs, due, during, e, each, early, e.g., eight, either, eleven, else,elsewhere, empty, end, ended, ending, ends, enough, etc, even, evenly, ever, every,everybody, everyone, everything, everywhere, except, f, face, faces, fact, facts, far,felt, few, fifteen, fifty, fill, find, finds, fire, first, five, for, former, formerly, forty,found, four, from, front, full, fully, further, furthered, furthering, furthers, g, gave,general, generally, get, gets, give, given, gives, go, going, good, goods, got, great,greater, greatest, group, grouped, grouping, groups, h, had, has, hasn’t, have,having, he, hence, her, here, hereafter, hereby, herein, hereupon, hers, herself, high,higher, highest, him, himself, his, how, however, hundred, i, i.e., if, important, in,inc, indeed, interest, interested, interesting, interests, into, is, it, its, itself, j, just, k,keep, keeps, kind, knew, know, known, knows, l, large, largely, last, later, latest,latter, latterly, least, less, let, lets, like, likely, long, longer, longest, ltd, m, made,make, making, man, many, may, me, meanwhile, member, members, men, might,mill, mine, more, moreover, most, mostly, move, Mr., Mrs., much, must, my, myself,n, name, namely, necessary, need, needed, needing, needs, neither, never,nevertheless, new, newer, newest, next, nine, no, nobody, non, none, nor, not,nothing, now, nowhere, number, numbers, o, of, off, often, old, older, oldest, on,once, one, only, onto, open, opened, opening, opens, or, order, ordered, ordering,orders, other, others, otherwise, our, ours, ourselves, out, over, own, p, part, parted,parting, parts, per, perhaps, place, places, please, point, pointed, pointing, points,possible, present, presented, presenting, presents, problem, problems, put, puts, q,quite, r, rather, re, really, right, room, rooms, s, said, same, saw, say, says, second,seconds, see, seem, seemed, seeming, seems, sees, serious, several, shall, she,should, show, showed, showing, shows, side, sides, since, sincere, six, sixty, small,smaller, smallest, so, some, somebody, somehow, someone, something, sometime,sometimes, somewhere, state, states, still, such, sure, system, t, take, taken, ten,than, that, the, their, them, themselves, then, thence, there, thereafter, thereby,therefore, therein, thereupon, these, they, thick, thin, thing, things, think, thinks,third, this, those, though, thought, thoughts, three, through, throughout, thru,thus, to, today, together, too, took, top, toward, towards, turn, turned, turning,turns, twelve, twenty, two, u, un, under, until, up, upon, us, use, used, uses, v,very, via, w, want, wanted, wanting, wants, was, way, ways, we, well, wells, went,were, what, whatever, when, whence, whenever, where, whereas, whereby, wherein,whereupon, wherever, whether, which, while, whither, who, whoever, whole, whom,whose, why, will, with, within, without, work, worked, working, works, would, x, y,year, years, yet, you, young, younger, youngest, your, yours, yourself, yourselves, z,

Figure 30: List of Most Common Stop words

The one of the most popular stemming algorithm is Porter Stemming

Algorithm. It is used, as part of a term normalization process, to

remove the common morphological and inflectional endings from

words in English. In English, most of the morphological variation



takes place at the right-hand end of a word-form and hence stemmer

for English is sometimes also known as suffix tripping algorithm. The

Porter Stemming Algorithm uses explicit list of suffixes, and a

criterion, for each suffix, under which it can be removed from a word

to leave a valid stem. It treats complex suffixes as compounds made

up of simple suffixes, and removes simple suffixes in a number of

steps. The complete description of an algorithm is given in [55]. The

original stemmer was written in BCPL programming language, but it is

adapted in various modern programming languages.

5.5 Summary

WordNet organizes information around logical groupings called

synsets. Each synset consists of a list of synonymous word forms and

semantic pointers that describe relationships between the current

synset and other synsets. It is widely used to derive semantic

similarity between words.



understood from the document’s constituent terms. It allows ranking

documents for their relevance to query using a similarity measure

such as cosine similarity.

Stop words, also known as non context bearing words, are commonly

occurring words that are unlikely to give useful information. Hence,

they are excluded from process by many IR systems to reduce the

storage space and processing time.

The words with the same meaning may appear in various

morphological forms. To capture their similarity they are normalized

using stemming algorithm into a common root/base form, called

stem.