shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/21171/5/5.doc · web...
Post on 19-Jan-2020
1 Views
Preview:
TRANSCRIPT
1. PREPROCESSING
4.1 INTRODUCTION
In Chapter 2, the research work reviewed information retrieval and cross
language information retrieval. Based on the literature review a framework is proposed
in Chapter 3. In the following chapter, the pre-processing of user’s query is explained.
The preprocessing stage accepts the query in Telugu and processes it using the
grammar rules and ontology to arrive at an intermediate English construct. This will then
be given to the search engine in the post processing stage. The grammar rule structure
and the ontological model are also explained. Finally, case studies of how the input
Telugu query is converted to the output English intermediate constructs are shown.
4.2 METHODOLOGY OF PROPOSED PRE-PROCESSING
The major objective of this research work (pre-processing stage) is to convert the
user query in Telugu into the relevant English constructs. There are three distinct
components that contribute to the success of the pre-processing. Figure 4.1 shows the
overall process of query pre-processing.
Figure 4.1 Overall process of query pre-processing
4.2.1 Tokenizer
The user gives the query to the system. The tokenizer divides text into a
structure of tokens. All contiguous strings of alphabetic characters are part of one token.
Figure 4.2 shows the tokenizer component in pre-processing
Figure 4.2 Tokenization component
Tokens are separated by whitespace characters, such as a space or line break,
or by punctuation characters. Figure 4.3 explains the working process of a sample user
given Telugu query.
Figure 4.3 Tokenizer process
Steps in tokenization of user query are given below,
TokenizationInput Telugu query
Output
Segmenting Text into Words: The boundary identification is a somewhat trivial
task since the majority of Telugu language characters are bound by explicit
structures. A simple program can replace white spaces with word boundaries and
cut off leading and trailing quotation marks, parentheses and punctuation. In
figure 4.4 a sample text segmentation process is shown.
Figure 4.4 Simple Telugu sentence tokenization
Handling Abbreviations: In Telugu language a period is directly attached to the
previous word. However, when a period follows an abbreviation it is an integral
part of this abbreviation and should be tokenized together with it. Figure 4.5
shows the sample sentence with abbreviations is shown.
Figure 4.5 Tokenizer example
Numerical and special expressions are difficult to handle in Telugu language.
They can produce a lot of confusion to a tokenizer because they usually involve
rather complex alpha numerical and punctuation syntax. For this the blank
Example:Telugu query: సచిన్ ఆడుతున్న మ్యా�చ్ (Sachin playing match) Tokenized terms:సచిన్ (Sachin)ఆడుతున్న (playing)మ్యా�చ్ (match)
Example:Telugu query: దీని ధర ఎంత (How Much) Tokenized terms:
దీని ధర ఎంత
spaces between the words are considered. In Figure 4.6 a sample example of
special expression tokenization of query is shown.
Figure 4.6 Tokenizer example
for special expressions
4.2.2 Language Grammar Rules
The tokens are sent to the language grammar rule component to process. The
detailed flow of the grammar structure is explained in Appendix 1. In this sub section,
the essence is explained briefly.
The essence of Telugu grammar is as follows.
It follows the Subject, Object and Verb (SOV) pattern.
There are three persons, namely, First person, Second person and Third person,
Two way distinctions in Number namely Singular (Sg.) and Plural (pl.) and three
way distinctions of Gender namely Masculine, Feminine and Neutral.
Feminine singular belongs to the Neuter and the Feminine plural belongs to the
Human.
Example:Telugu query: రక్షక భటులని పిలవండి! (Call the security person) Tokenized terms:రక్షక (the security)భటులని (person)పిలవండి! (Call)
Apart from the three types of tenses, namely, Past, Present and Future, Telugu
has one more special tense that is, the Future Habitual.
Figure 4.6 shows the language grammar rules component.
Figure 4.7 Language Grammar rules component
The grammar rules are used to preprocess the text. The idea is to identify the
appropriate word sense in the text. This helps to avoid the issues of out of vocabulary
text. If the user query is a complex one the reordered sentence will be sent to the
morphological analyzer to identify the tense of a verb and inflections that are adding to
verb. But the morphological structure of Telugu verbs inflects for tense, person, gender,
and number. The nouns inflect for plural, oblique, case and postpositions. Figure 4.8
explains the working process of a sample user given Telugu query.
Figure 4.8 Grammar rules component process
The structure of verbal complexity is unique and capturing this complexity in a
machine analyzable and generatable format is a challenging task. Inflections of the
Telugu verbs include finite, infinite, adjectival, adverbial and conditional markers. The
verbs are classified into certain number of paradigms based on the inflections.
For computational need In Telugu language there are 37 paradigms of verb and
each paradigm with 160 inflections and sixty seven paradigms are identified for Telugu
noun. Each paradigm has 117 sets of inflected forms. Based on the nature of the
inflections the root words are classified into groups. An example is shown in Table 4.1.
Table 4.1 Sample Telugu sentence order
Sentence దినేష్ పనికి వెళ్లతాడు .
Words దినేష్ పనికి వెళ్లతాడుు.
Transliteration Dinesh Paniki veḷtāḍu
Gloss Dinesh to work goes.
Parts Subject Object Verb
Converted Dinesh goes to work.
Telugu pronouns include Personal pronouns and Demonstrative, pronouns (The
persons speaking, the persons spoken to, or the persons or things spoken about),
Reflexive pronouns (in which the object of a verb is being acted on by verb's subject),
Interrogative Pronoun, Indefinite pronoun, Demonstrative adjective and Interrogative
adjective Pronouns, Possessive adjective Pronouns, Pronouns referring to numbers and
Distributive Pronouns.
Telugu language uses postpositions for word in different cases. With the use of
postpositions, there are eight possible cases (vibhakti) is shown in Table 4.2.
A noun in Telugu is the markings of gender, number, person and case makers
are identified in three noun distinctions indicating: Human male/females, singular/ plural
and non-humans. For the noun denotes human male it should end with inflection “-du”
and for the human females it ends with “-di”.
In number marking on noun cases it occurs in singular and plural. In case of
large number of nouns the form of the plural inflection is “–lu”, while in case of some
nouns of human male category, the form of plural suffix alternant is “–ru”. For gender
number person marking on nouns is explicit only in 1st and 2nd person in both singular
and plural cases. Telugu language uses a wide variety of case markers and post-
position suffixes are those which express grammatical case relations such as
nominative, accusative, dative, instrumental, genitive, commutative, vocative and
causal.
Table 4.2 Post positions for Telugu sentence order
Telugu English SignificanceUsual
SuffixesTransliteration of
Suffixes
Panchami
Vibhakti
(పంచమీ విభకి()
Ablative of
motion from
Motion from an
animate/inanimate
object
వలనన్,
కంటెన్, పట్టి+
valanan, kaMTen,
paTTi
Dviteeya
Vibhakti
(ది,తీయా
విభకి()
Accusative Object of action
నిన్, నున్,
లన్, కూర్చి2,
గుర్చించి
nin, nun, lan,
kUrci, guriMci
Chaturthi
Vibhakti
(చతుర్చి4 విభకి()Dative
Object to whom
action is performed,
Object for whom
action is performed
కొఱకున్, కై korakun, kai
Shashthi
Vibhakti (షష్ఠీ:
విభకి()
Genitive Possessive
కిన్, కున్,
యొక్క, లోన్,
లోపలన్
kin, kun, yokka,
lOn, lOpalan
Truteeya
Vibhakti
(తృతీయా
విభకి()
Instrumental,
Social
Means by which
action is done
(Instrumental),
Association, or
means by which
action is done
(Social)
చేతన్, చేన్,
తోడన్, తోన్
cEtan, cEn,
tODan, tOn
Saptami
Vibhakti
Locative Place in which, On
the person of
(animate) in the
అందున్, నన్ aMdun, nan
(సప్తమీ విభకి() presence of
Prathama
Vibhakti
(ప్రథమ్యా విభకి()Nominative Subject of sentence
డు, ము, వు,
లుDu, mu, vu, lu
A verb in Telugu sentence is a finite or non-finite verb which occurs according to
the situations like rising pitch, meaning question, level pitch, falling pitch, and meaning
command. In Telugu all verbs have finite and non-finite forms.
A finite form is one that can stand as the main verb of a sentence and occur
before a final pause (full stop) and a non- finite form cannot stand as a main verb and
rarely occurs before a final pause. There are eight finite rules for Telugu verb arranged
in three verbal structures: stem or inflection root, tense mode suffix and personal suffix.
These rules are discussed below in table 4.3 for a verb “ ”ఆట్లా్ల డు (playing) with a root
word “ఆట్ల” (play).
Table 4.3 Finite verb rules
Type Structure Rule Example
Inflection or
Stem root(Rule 1) Imperative
Singular –du atla –du
Plural –andi atla –andi
Tense –
mode suffix
(Rule 2) Admonitive
or abusive
kAlu (to burn),
kUlu (to fall),
pagulu (to break)
In this case due to semantic
restrictions, many verbs
cannot occur in this mode
(Rule 3) Obligative
(in all persons)-Ali
atlad –Ali (I, We, You)
(singular, plural)
Personal
suffix (es)
(Rule 4) Habitual-
future or non-past-ta-
atla – ta – Am (we shall play)
atla – ta – Adu (He shall play)
atla – tun – di (she will play)
atla – ta – Anu (I shall play)
atla – ta – Ava (you will play)
atla – ta – Ay (they play)
atla – ta – Aru (they will play)
(Rule 5) Past tense -din-
atla – din – Anu (I played)
atla – din – Ava (you played
(Singular))
atla – din – Aru (you played
(plural))
atla – din – Am (we played)
atla – din – Adu (he played)
atla – din – di (she/ it played)
atla – din – Aru (they played)
(Rule 6) Hortative -da- atla – da – tAm (let us play, or
we shall play)
(Rule 7) Negative
tense-data-
atla – data – va (you (do, did,
and shall) not play)
atla – data – Du (he (does,
did, and shall) not play)
atla – data – nu (I (do, did,
and shall) not play)
atla – data – m (we (do, did,
and shall) not play)
atla – data – ru (they (do, did,
and shall) not play)
atla – data – du (she/ it (do,
did, and shall) not play)
(Rule 8) Negative
imperative or
prohibitive
-Ak-
atla – Ak – andi (you (plural)
don’t play)
atla – Ak – u (you (singular)
don’t play)
In the same way Non Finite Verbs are ten verbs which may be arranged into two
structural types like Unbound and Bound and this rules are shown in Table 4.4 non-
finite verb rules.
Table 4.4 Non-finite verb rules
Type Structure Rule Example
Bound type(Rule 9)
Present-ta-un-
atladu- ta- unnAnu (I am playing)
atladu - ta- un- nA (even playing
(now))
atladu - ta- un- tE (if playing)
atladu - ta- un- na (that playing)
Unbound type
(Rule 10)
Concessive-dinA atla- dinA (even though played)
(Rule 11)
Conditional-itE atla- itE (if played)
(Rule 12)
Present
participle
-dutu atla- dutU (playing)
(Rule 13) Past
participle-di atla- di (having played)
(Rule 14) -ta atla –ta (to play)
Infinitive
(Rule 15) Past
adjective-dina atla- dina (that played)
(Rule 16)
Negative
adjective
-dani atla- dani (not played)
(Rule 17)
Negative
participle
-aku atla- aku (not playing)
(Rule 18)
Habitual
adjective
-dE atla- dE (that plays)
The subject, object, verb and inflection are identified using the above grammar
rules.
4.2.3 Bilingual Ontology
The terms are looked into the ontology for the English equivalent terms. The
bilingual ontology for information retrieval is constructed based on the English Telugu
language vocabulary relationships. In this research work ontology is a key element for
the pre-processing of the query and the post-processing of the results. Block diagram of
bilingual ontology component is shown in the Figure 4.9.
Figure 4.9 Ontology Component
Ontology may take a variety of forms, but necessarily it will include a vocabulary
of terms, and some specification of their meaning. It includes the definitions and an
indication of how the concepts are inter-related which collectively impose the structure
on a domain and constrain the possible interpretations of the terms. Figure 4.11
illustrates the workflow of bilingual ontology component in the preprocessing stage for
the CLIR and it also shows the connecting relationship of ontology terms.
Figure 4.10 Process flow of bilingual ontology component
Firstly, the English terms are mapped with Telugu terms, which come from
Telugu English bilingual dictionary, Consequently, English Telugu ontology may contain
terms that do not appear in the original Telugu English bilingual dictionary, or vice
versa. It compares the number of terms in both versions. The termNs that do not appear
in both languages are considered as Out Of Vocabulary (OOV) terms. The result of the
alignment is the term list which is treated as the basis for extension of ontology. Each
Telugu term in the list is considered as a seed term, which is used to search for Telugu
synonyms online.
Secondly, the search engine is used to retrieve results in Telugu for each Telugu
term, which are assumed to contain candidate Telugu synonyms. Thirdly, Telugu
translations of terms are extracted from the retrieved results using sequential
application of the following: a) linguistic rules, which provide the text segments
potentially containing translations; b) mutual information filtering, which refines the
candidate translations. Fourthly, the frequencies of each English term and Telugu
translation in the results retrieved by search engine are calculated; and term weights
are computed using these frequencies.
Figure 4.11 Ontology Relationship Hierarchies
Finally, the aligned term pairs, the English translations, term weights, and the
ontology entry terms are merged according to the ontology hierarchy, forming the
Telugu English bilingual ontology. The order of displaying the suggestions is shown
Query
RelatedRelevant RelatedRelevantMeaning
Meaning RelationshipRelated
RelatedRelevantMeaning
1
4 8 10
3
7 9 115 6
2
below in Figure 4.12 the meaning, relationship terms, and related terms are expanded
in the order and shown to the users.
In this research work ontology the terms are considered into four types of
records: meaning, related, relevant and supplementary concept record. All of them are
used in day to day life of the users. A sample structure of the ontology is shown in figure
4.13.
Figure 4.12 Sample ontology structure
4.2.4OOV
Component
The terms that are not available in ontology are considered as out of vocabulary
terms. These terms are handled by the Out of Vocabulary component. The Block
diagram of OOV component is shown in the Figure 4.12.
Sports: ఆటలు
Clubs: క్లబ్స్J
Competitions: పోటీలుhas
Is a
Personal:వ�తిగత
Tournaments:తౌర్నమెంత్స్J
Family: గూQ ప్
Sub class of
Sub class of Sub class of
Fav team: జటు+
group:గుంపు
Location: ప్రా్ర ంతం
Is a
Is a
has
Players:ఆటగాలుు
Cricket:కిర్కే్కట్
Football:ఫుట్లాZల్
Tennis:టెని్నస్
Audience:ప్రే్రక్షకులు
not
Umpire:అంపైర్
Regions: ప్రదేశం ప్రా్ర ంతాలు
Country: దేశం
has
Is a
Is a
Is a
has
has
Figure 4.13 Out of Vocabulary Component
The out of vocabulary processing system transliterates the term into target
language. This helps to avoid the issues of out of vocabulary text. With this the terms
are rearranged and the query is converted into the target language. The pre-processing
of the query done and the same is sent along with the user given query to web for
results related to the quires.
Case 1 shows an example how the user given query is processed and converted into
English language using pre-processing system. Here a step by step process of the pre-
processing system for query is discussed below:
Step1: User enters the query “ చెన్నైdలో మంచి భోజనశాల” (good hotel in Chennai)
Step2: tokenizer tokenize the query into tokens
చెన్నైdలో (token 1) మంచి (token 2) భోజనశా ల (token 3)
Step3: Apply grammar rules to the tokens, first look into the tokens for inflection.
If any inflection is found and the equivalent grammar rule is used to
identify subject object and verb
In the above tokens “లో” is the inflection term, it is attached with the
subject “చెన్నైd” and the verb here is “మంచి” and the object is “భోజనశా ”ల .
Step4: once the terms (subject, object verb and inflection) are identified then look
into the ontology for equivalent terms.
Here, the terms చెన్నైd (chennai) and భోజనశా ల (hotel) are found in ontology
and the inflection లో (in) is taken from inflection table. But the term మంచి
(good) is not available in the ontology.
Step5: the terms that are not available in ontology are sent to the OOV
component to transliterate literally
Step6: once the terms are converted now the query is constructed in English
using the subject object verb and inflection. Here the above query is
constructed as “manchi hotel in Chennai”
Step7: now the query is sent to the next stage for results
The flow chart for the pre-processing system is shown in the Figure 4.13.
Figure 4.14 Flow Chart for the
Pre-
Processing stage
Start
User enters the query in Telugu
Tokenize the user query into tokens
Rule identification based on the inflection and verb
Lookup into the ontology for
equivalent terms
Stop
Language Grammar Rules to identify Subject, Object and verb
Inflection Table lookup
Yes
Query reconstruction into source language
Transliteration
No
4.3 CONCLUSION
The user given query is processed in preprocessing and the query is converted
into the source language using the language grammar rules and the ontology. Here the
grammar rules play a major role in identifying the terms (subject, object and verb) and
also the rule to convert the query. With the help of ontology the terms are easily looked
up and the terms that are not available in ontology are also transliterated using the OOV
component. Thus the Telugu query has been converted into the English equivalent now
the query will be processed by the search engine and relevant results retrieved.
top related