a hybrid approach to urdu verb phrase chunking
TRANSCRIPT
-
8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking
1/5
A hybrid approach to Urdu Verb Phrase Chunking
Wajid AliDepartment of Computer SciencesNational University of Computer and Emergence Sciences
Lahore, [email protected]
Sarmad HussainDepartment of Computer Sciences
National University of Computer and Emergence SciencesLahore, Pakistan
AbstractA variety of verb phrases exist in Urdu including
simple verb phrases, conjunct verb phrases and compound
verb phrases. This paper explains the structure of Urdu
verb phrases, and details a series of experiment to
automatically tag them. A 100,000 words Urdu corpus is
manually tagged with VP chunk tags. The corpus is then
used to develop a hybrid approach using HMM basedstatistical chunking and correction rules. Initially, a
statistical baseline model is developed. The technique is
enhanced by changing chunking direction and merging
chunk and POS tags. The automatically chunked data is
compared with manually tagged held-out data to identifyand analyze the errors. Based on the analysis, correction
rules are extracted to address the errors. By applying these
rules after statistical tagging, further improvement is
achieved in chunking accuracy. The results of all
experiments are reported with maximum overall accuracy
of 98.44% achieved using hybrid approach with extended
tagset.
Keywords-Urdu Verb Phrase; Phrase Chunking; Statistical
chunking; Hybrid chunking;
I. INTRODUCTIONUrdu, spoken by more than 100 million speakers in, is
national language of Pakistan, state language of India and isspoken across middle east and many other countries. It is anIndo-Aryan language, and is a free phrase-order language, i.e.the phrases within a sentence have free order and
1
1Though the meaning does not change, the phrase-order may change the
emphasis of the phrase within a sentence.
, however,the words within a phrase have a fixed order. As the order ofthe phrases is variable, the case markers (CM), which areexplicitly written in Urdu as separate words, help determine therole of each phrase in the sentence. Verb Phrase (VP) is thehead of the sentence and licenses the number as well as role ofthe other phrases in a sentence, e.g. subject, object, etc. Thenumbers of arguments licensed depend on the valency of theverb, also categorized as intransitive, transitive and di-transitive. This information is normally encoded in the sub-categorization frame of a verb, which lists the number and typeof arguments the verb licenses. Determining these phraseswithin a sentence is very useful for a variety of applications in
Natural Language Processing (NLP), and the process whichdirectly labels these phrases is called chunking. Chunkinghelps to identify phrases in a sentence which are used in thedevelopment of natural language processing (NLP)applications like parsing, machine translation, questionanswering application and information extraction. The current
work focuses on chunking VP in Urdu.Relevant Urdu verb phrase analysis is summarized in Section2. Section 3 presents some relevant chunking related work.Section 4 contains the detail of the tagged corpus developed forthis task. Methodology is discussed in Section 5. The resultsand discussion are presented in Section 6. Section 7 concludesthe word.
II. VERB PHRASES IN URDUMinimally, an Urdu VP is represented by a single verb.
However, a typical Urdu verb phrase contains a verb followed by one or more auxiliaries (AUX) and verb tense markers(VBT). Each is represented by a separate word. Some of the
tense and aspect information is also encoded within the verbmorphology. An Urdu verb phrase can be categorized into asimple verb phrase or complex verb phrase. In a complex verbphrase, the verb is formed by a combination of nominal + verb(called conjunct verb) or a verb + verb (called compound verb).These complex verbs are also referred to as complexpredicates.
Figure 1: Categorization of Urdu verb phrase
The following subsections explain these types of verbphrases.
A. Simple Verb PhraseA simple verb phrase consists of a verb root followed by
auxiliaries and verb tense, if any, as shown in (1).
Verb Phrase
Simple verbphrase
Complex verbphrase
Conjunt verbphrase
Compoundverb phrase
-
8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking
2/5
(1)
1F2Ali school gayaAli school wentAli went to school
Ali school jaata hai
Ali school go VBT-presentAli goes to the school
B. Complex PredicateA conjunct verb phrase consists of a nominal or adjective
and a verb, optionally followed by auxiliaries and verb tensemarker, as shown in (2).
(2)
Ali ne sabaq yaad kiya
Ali CMlessonUlearn do-past
Ali learnt the lesson
Ali ne kamraa saaf kiya
Ali CMroomUclean do-past
UVBT-past
Ali cleaned the room
A compound verb phrase consists of two verbs, optionallyfollowed by auxiliaries and verb tense marker. The first verb ismain verb and contributes the meaning of the sentence. Thesecond verb adds additional information as shown in (3).
(3)
Ali kam kar baeTha haiAli work
Udo sit
UVBT-present
Ali has done the work
Detailed analysis of Urdu VP is not in the scope of thispaper. For further discussion, see e.g. [2] and [5].
III. EARLIER WORK ON CHUNKINGThe work on chunking based on machine learning was
introduced by Church for English [8]. The idea of parsing bychunks, defining the chunks in English by assuming that achunk has syntactic structure was proposed [1]. Chunking wasused to convert sentences into non-overlapping phrases, likeVP and Noun Phrase (NP), to parse the sentence. A probabilistic chunker based on [1] was proposed [4]. Thetransformation based learning using a large annotated corpusfor English was used [11]. They proposed chunking as an IOBtagging task, where I marks the words which are Inside achunk, O marks the words which are Outside the chunk and B
2Urdu sentences are written from right to left.
marks the words which are at the Beginning of a chunk.Overall recall and precision achieved by this approach is about88%.
The standard HMM based tagging methods used to modelthe chunking process, and achieved an accuracy of 91.99% precision and 92.25% recall using a contextual lexicon [14].The memory based phrase chunking used and accuracy of91.05% precision and 92.03% recall for English was achieved[7]. The accuracy of 93.48% for English was achieved by usingsupport vector machines [13]. HMM based chunk tagger forHindi was presented [12]. They achieved 92% precision with100% recall for chunk boundaries by the HMM based chunker.The rule based chunking using XML was proposed [6]. Theyreported 90.18% precision and 92.49% recall for verb groupchunker for English.
A hybrid approach using rule based and memory basedlearning to chunk the phrases of Korean language wasproposed [10]. First, the rule based chunker is applied to chunkthe phrases then memory based learning technique is used forthe correction of errors which were not handled by rule basedchunker.
IV. CORPUS AND TAGSETPart of speech (POS) tagged corpus containing 4585
sentences and consisting of 101414 words is used for this work[9]. Complex phrase is composed of a nominal, adjective or averb combined with a light verb. In the POS tagged corpusused, light verbs are not tagged separately. However, taggingsuch verbs as light verbs helps determine whether thepreceding noun is part of verb phrase. So, we customized thetagset by introducing light verb tag (VBL) and infinitive lightverb tag (VBLI) to better address the compound and conjunctverb phrases. The IOB tagset is used to prepare chunkannotated data. The I tag marks the words which are Inside a
chunk, O marks the words which are Outside the chunk and Bmarks the words which are in the Beginningof a chunk. Thedata of 3650 sentences containing 81430 words is for training,530 sentences containing 9985 words are used for analysisduring the implementation of methodologies and the remaining405 sentences with 9999 words are used for testing.
V. METHODOLOGYA hybrid approach is used for VP chunking. First, statistical
approach is used and then rules are applied for correction. Themethodology is described below.
A. Statistical ChunkingGiven a sequence ofn words, there are corresponding t1 to
tn POS tags and c1 to cn chunk tags. The aim is to find the mostprobable chunk sequence for given the POS tags.
= max (1 1
) .(1) (1)
We assume that the probability of a POS tag depends on itsown chunk tag and the probability of a chunk tag is dependentonly on the previous two chunk tags. Using chain rule, problemis reduced to the following equation.
-
8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking
3/5
= max ( ).( 1 , 2 )=1 (2)
TnT tagger is used for training and testing. All theexperiments are executed using its default option of secondorder HMM (Trigram model).
B. Chunking by rulesA rule based system used in post-processing for error
correction. Here we discuss some errors and correspondingrules for correction. The complete list of rules is included in theappendix. The most frequent error was assigning I tag to VBT,when it was not preceded by a verb, as it can come by itself. Inthis case, it should have been assigned B tag, as it is the beginning of the verb phrase. For example, see the tagsunderlined below.
-
8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking
4/5
TABLE II. RESULTS OF ALL EXPERIMENTS
No. Methodology Over all result (%)B-tag I-tag O-tag
Precision Recall Precision Recall Precision Recall
1. Statistical using IOB tagset and RTL scanning 95.14 82.52 75.08 88.45 85.17 97.54 99.22
2. Hybrid using IOB tagset and RTL scanning 98.14 96.21 87.54 90.92 96.62 99.42 99.72
3. Statistical using IOB Extended tagset and RTL scanning 95.95 83.98 79.13 88.88 86.41 98.27 99.51
4. Hybrid using IOB Extended tagset and RTL scanning 98.44 97.16 90.07 93.10 96.62 99.45 99.77
5. Statistical using IOB tagset and LTR scanning 95.06 81.64 74.77 88.20 84.93 97.62 99.22
6. Hybrid using IOB tagset and LTR scanning 98.02 95.95 86.32 91.57 96.62 99.28 99.69
7. Statistical using IOB Extended tagset and LTR scanning 95.86 83.24 78.01 88.45 85.67 98.78 99.518. Hybrid using IOB Extended tagset and LTR scanning 98.29 96.80 88.75 91.78 96.62 99.42 99.74
Experiments are also performed using extended tagset bymerging IOB tag with POS tag in statistical based chunkingapproach and hybrid approach. The overall accuracy ofexperiment is improved to 95.95%. Rule based chunker isapplied on output of statistical chunker and overall accuracy is98.44% of this hybrid approach.
When the scanning direction is changed to Left to Right(LTR), the overall accuracy of statistical approach with simpletagset is 95.06% and 98.02% overall accuracy is obtained usinghybrid approach with simple tagset. Now extended tagset isused with statistical and hybrid approaches and the overallaccuracy of 95.86% and 98.29% is achieved respectively.
B.DiscussionThe aim of this research was to perform automatic verb
phrase chunker in Urdu. To get maximum accuracy differentexperiments were conducted using statistical and hybridapproach. The intention was to mark the factors which wereimportant for producing high accuracy. The experiment inwhich simple tagset (IOB), statistical based chunking and rightto left (RTL) scanning is followed is used as base lineexperiment and achieve 95.14% accuracy. It is observed that afew simple rules can give a significant 3% improvement in
accuracy. So statistical chunking and rules based correction isapplied for best results. The best overall accuracy achieved inall experiments is 98.44%. Merging POS tag with IOB taggives minor improvement in accuracy but reversing scanningdirection decreases accuracy.
VII. CONCLUSIONIn this paper, we have proposed a hybrid approach to learn
verb phrase chunking in Urdu which is base on HMM basedstatistical chunking and rule based correction afterwards. Wehave performed different experiments to get maximumaccuracy and found the experiment's scheme based on hybridapproach with extended tagset and Right to Left scanning givesthe best accuracy in all experiment schemes. The best accuracy
is 98.44%.
APPENDIX
The rules for verb phrase chunking are as following:
1. If () = and (1) = , Thenchunk tag for is.
2. If () = and (1) = , Thenchunk tag for is.
3. If () = and (1) = , Thenchunk tag for is.
4. If () = and (1) = , Thenchunk tag for is.
5. If () = and (1) = , Thenchunk tag for is.
6. If () = , (1) = and(2) = , Then chunk tag for is.
7. If () = , (1) = and(2) = , Then chunk tag for1 is
and is.
8.
If () = , (1) = and(2) = , Then chunk tag for1 is
and is.
9. If () = , (1) = and(2) = , Then chunk tag for1 is
and is.
10. If () = , (1) = and (2) = , Then chunk tag for1
is and is.
11. If () = , (+1) = and(+2) = , Then chunk tag for+1 is.
12. If () = , (+1) = and(+2) = , Then chunk tag for+1 is.
REFERENCES
[1] S. Abney, Parsing by Chunks: Principle based parsing, KluwerAcademic Publishers, Dordrecht.
[2] M. Butt, The Structure of Complex Predicates in Urdu, Stanford, CA:CSLI Publications, 1995.
[3] T. Brant, TnT: a statistical part of speech tager, In proceeding of thesixth conference on applied natural language processing, Seattle,Washington, 2000, pp. 224231.
[4] C. Kuang-hua and C. Hsin-His, A Probablistic Chunker, Inproceedings of ROCLING VI, 1993.
[5] D. Chakrabarti, H. Mandalia, R. Priya, V. Sarma, and P. Bhattacharyya,Hindi Compound Verbs and their Automatic Extraction,
Computational Linguistics (COLING08), Manchester, UK, 2008.
[6] C. Grover and R. Tobin, Rule Based Chunking and Reusability, In proceedings of the Fifth international conference on LanguageResources, 2007.
[7] J. Veenstra and A. van den Bosch, Single- Classifier Memory-BasedPhrase Chunking, In proceedings of CoNLL-2000 and LLL-2000,2000, pp. 157159.
[8] K. Church, A stochastic parts program and noun phrase parser forunrestricted text, In proceeings of Second Conference on AppliedNatural Language Processing, pp. 136143.
[9] A. Muaz, A. Ali and S. Hussain, Analysis and Development of UrduPOS Tagged Corpora, In proceedings of the 7th Workshop on AsianLanguage Resources, IJCNLP09, Suntec City, Singapore, 2009.
-
8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking
5/5
[10] P. Seong-Bae and Z. Byoung-Tak, Text Chunking by CombiningHand-Crafted Rules and Memory-Based Learning, In proceedings ofthe 41st Annual Meeting on Association for Computational Linguistics -
Volume 1, 2003, pp. 497504.
[11] A. L. Ramshaw and M. P. Marcus, Text chunking using transformation based learning, In proceedings of the third ACL workshop on VeryLarge Corpora, Somerset, NJ, 1995, pp. 8294.
[12] A. Singh, S. Bendre and R. Sangal, HMM Based Chunker for Hindi,In proceedings of Posters, Intl. Joint Conf. on Natural LanguageProcessing (IJCNLP), 2001.
[13] T. Kudo and Y. Matsumoto, Chunking with Support Vector Machines,In proceedings of NAACL, 2001, pp. 10131015.
[14] G. Zhou, J. Su and T. Tey, Hybrid Text Chunking, In proceedings ofCoNLL- 2000 and LLL-2000, pp. 163165.