a hybrid approach to urdu verb phrase chunking

Upload: muhammad-kashif-ali

Post on 06-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking

    1/5

    A hybrid approach to Urdu Verb Phrase Chunking

    Wajid AliDepartment of Computer SciencesNational University of Computer and Emergence Sciences

    Lahore, [email protected]

    Sarmad HussainDepartment of Computer Sciences

    National University of Computer and Emergence SciencesLahore, Pakistan

    [email protected]

    AbstractA variety of verb phrases exist in Urdu including

    simple verb phrases, conjunct verb phrases and compound

    verb phrases. This paper explains the structure of Urdu

    verb phrases, and details a series of experiment to

    automatically tag them. A 100,000 words Urdu corpus is

    manually tagged with VP chunk tags. The corpus is then

    used to develop a hybrid approach using HMM basedstatistical chunking and correction rules. Initially, a

    statistical baseline model is developed. The technique is

    enhanced by changing chunking direction and merging

    chunk and POS tags. The automatically chunked data is

    compared with manually tagged held-out data to identifyand analyze the errors. Based on the analysis, correction

    rules are extracted to address the errors. By applying these

    rules after statistical tagging, further improvement is

    achieved in chunking accuracy. The results of all

    experiments are reported with maximum overall accuracy

    of 98.44% achieved using hybrid approach with extended

    tagset.

    Keywords-Urdu Verb Phrase; Phrase Chunking; Statistical

    chunking; Hybrid chunking;

    I. INTRODUCTIONUrdu, spoken by more than 100 million speakers in, is

    national language of Pakistan, state language of India and isspoken across middle east and many other countries. It is anIndo-Aryan language, and is a free phrase-order language, i.e.the phrases within a sentence have free order and

    1

    1Though the meaning does not change, the phrase-order may change the

    emphasis of the phrase within a sentence.

    , however,the words within a phrase have a fixed order. As the order ofthe phrases is variable, the case markers (CM), which areexplicitly written in Urdu as separate words, help determine therole of each phrase in the sentence. Verb Phrase (VP) is thehead of the sentence and licenses the number as well as role ofthe other phrases in a sentence, e.g. subject, object, etc. Thenumbers of arguments licensed depend on the valency of theverb, also categorized as intransitive, transitive and di-transitive. This information is normally encoded in the sub-categorization frame of a verb, which lists the number and typeof arguments the verb licenses. Determining these phraseswithin a sentence is very useful for a variety of applications in

    Natural Language Processing (NLP), and the process whichdirectly labels these phrases is called chunking. Chunkinghelps to identify phrases in a sentence which are used in thedevelopment of natural language processing (NLP)applications like parsing, machine translation, questionanswering application and information extraction. The current

    work focuses on chunking VP in Urdu.Relevant Urdu verb phrase analysis is summarized in Section2. Section 3 presents some relevant chunking related work.Section 4 contains the detail of the tagged corpus developed forthis task. Methodology is discussed in Section 5. The resultsand discussion are presented in Section 6. Section 7 concludesthe word.

    II. VERB PHRASES IN URDUMinimally, an Urdu VP is represented by a single verb.

    However, a typical Urdu verb phrase contains a verb followed by one or more auxiliaries (AUX) and verb tense markers(VBT). Each is represented by a separate word. Some of the

    tense and aspect information is also encoded within the verbmorphology. An Urdu verb phrase can be categorized into asimple verb phrase or complex verb phrase. In a complex verbphrase, the verb is formed by a combination of nominal + verb(called conjunct verb) or a verb + verb (called compound verb).These complex verbs are also referred to as complexpredicates.

    Figure 1: Categorization of Urdu verb phrase

    The following subsections explain these types of verbphrases.

    A. Simple Verb PhraseA simple verb phrase consists of a verb root followed by

    auxiliaries and verb tense, if any, as shown in (1).

    Verb Phrase

    Simple verbphrase

    Complex verbphrase

    Conjunt verbphrase

    Compoundverb phrase

  • 8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking

    2/5

    (1)

    1F2Ali school gayaAli school wentAli went to school

    Ali school jaata hai

    Ali school go VBT-presentAli goes to the school

    B. Complex PredicateA conjunct verb phrase consists of a nominal or adjective

    and a verb, optionally followed by auxiliaries and verb tensemarker, as shown in (2).

    (2)

    Ali ne sabaq yaad kiya

    Ali CMlessonUlearn do-past

    Ali learnt the lesson

    Ali ne kamraa saaf kiya

    Ali CMroomUclean do-past

    UVBT-past

    Ali cleaned the room

    A compound verb phrase consists of two verbs, optionallyfollowed by auxiliaries and verb tense marker. The first verb ismain verb and contributes the meaning of the sentence. Thesecond verb adds additional information as shown in (3).

    (3)

    Ali kam kar baeTha haiAli work

    Udo sit

    UVBT-present

    Ali has done the work

    Detailed analysis of Urdu VP is not in the scope of thispaper. For further discussion, see e.g. [2] and [5].

    III. EARLIER WORK ON CHUNKINGThe work on chunking based on machine learning was

    introduced by Church for English [8]. The idea of parsing bychunks, defining the chunks in English by assuming that achunk has syntactic structure was proposed [1]. Chunking wasused to convert sentences into non-overlapping phrases, likeVP and Noun Phrase (NP), to parse the sentence. A probabilistic chunker based on [1] was proposed [4]. Thetransformation based learning using a large annotated corpusfor English was used [11]. They proposed chunking as an IOBtagging task, where I marks the words which are Inside achunk, O marks the words which are Outside the chunk and B

    2Urdu sentences are written from right to left.

    marks the words which are at the Beginning of a chunk.Overall recall and precision achieved by this approach is about88%.

    The standard HMM based tagging methods used to modelthe chunking process, and achieved an accuracy of 91.99% precision and 92.25% recall using a contextual lexicon [14].The memory based phrase chunking used and accuracy of91.05% precision and 92.03% recall for English was achieved[7]. The accuracy of 93.48% for English was achieved by usingsupport vector machines [13]. HMM based chunk tagger forHindi was presented [12]. They achieved 92% precision with100% recall for chunk boundaries by the HMM based chunker.The rule based chunking using XML was proposed [6]. Theyreported 90.18% precision and 92.49% recall for verb groupchunker for English.

    A hybrid approach using rule based and memory basedlearning to chunk the phrases of Korean language wasproposed [10]. First, the rule based chunker is applied to chunkthe phrases then memory based learning technique is used forthe correction of errors which were not handled by rule basedchunker.

    IV. CORPUS AND TAGSETPart of speech (POS) tagged corpus containing 4585

    sentences and consisting of 101414 words is used for this work[9]. Complex phrase is composed of a nominal, adjective or averb combined with a light verb. In the POS tagged corpusused, light verbs are not tagged separately. However, taggingsuch verbs as light verbs helps determine whether thepreceding noun is part of verb phrase. So, we customized thetagset by introducing light verb tag (VBL) and infinitive lightverb tag (VBLI) to better address the compound and conjunctverb phrases. The IOB tagset is used to prepare chunkannotated data. The I tag marks the words which are Inside a

    chunk, O marks the words which are Outside the chunk and Bmarks the words which are in the Beginningof a chunk. Thedata of 3650 sentences containing 81430 words is for training,530 sentences containing 9985 words are used for analysisduring the implementation of methodologies and the remaining405 sentences with 9999 words are used for testing.

    V. METHODOLOGYA hybrid approach is used for VP chunking. First, statistical

    approach is used and then rules are applied for correction. Themethodology is described below.

    A. Statistical ChunkingGiven a sequence ofn words, there are corresponding t1 to

    tn POS tags and c1 to cn chunk tags. The aim is to find the mostprobable chunk sequence for given the POS tags.

    = max (1 1

    ) .(1) (1)

    We assume that the probability of a POS tag depends on itsown chunk tag and the probability of a chunk tag is dependentonly on the previous two chunk tags. Using chain rule, problemis reduced to the following equation.

  • 8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking

    3/5

    = max ( ).( 1 , 2 )=1 (2)

    TnT tagger is used for training and testing. All theexperiments are executed using its default option of secondorder HMM (Trigram model).

    B. Chunking by rulesA rule based system used in post-processing for error

    correction. Here we discuss some errors and correspondingrules for correction. The complete list of rules is included in theappendix. The most frequent error was assigning I tag to VBT,when it was not preceded by a verb, as it can come by itself. Inthis case, it should have been assigned B tag, as it is the beginning of the verb phrase. For example, see the tagsunderlined below.

  • 8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking

    4/5

    TABLE II. RESULTS OF ALL EXPERIMENTS

    No. Methodology Over all result (%)B-tag I-tag O-tag

    Precision Recall Precision Recall Precision Recall

    1. Statistical using IOB tagset and RTL scanning 95.14 82.52 75.08 88.45 85.17 97.54 99.22

    2. Hybrid using IOB tagset and RTL scanning 98.14 96.21 87.54 90.92 96.62 99.42 99.72

    3. Statistical using IOB Extended tagset and RTL scanning 95.95 83.98 79.13 88.88 86.41 98.27 99.51

    4. Hybrid using IOB Extended tagset and RTL scanning 98.44 97.16 90.07 93.10 96.62 99.45 99.77

    5. Statistical using IOB tagset and LTR scanning 95.06 81.64 74.77 88.20 84.93 97.62 99.22

    6. Hybrid using IOB tagset and LTR scanning 98.02 95.95 86.32 91.57 96.62 99.28 99.69

    7. Statistical using IOB Extended tagset and LTR scanning 95.86 83.24 78.01 88.45 85.67 98.78 99.518. Hybrid using IOB Extended tagset and LTR scanning 98.29 96.80 88.75 91.78 96.62 99.42 99.74

    Experiments are also performed using extended tagset bymerging IOB tag with POS tag in statistical based chunkingapproach and hybrid approach. The overall accuracy ofexperiment is improved to 95.95%. Rule based chunker isapplied on output of statistical chunker and overall accuracy is98.44% of this hybrid approach.

    When the scanning direction is changed to Left to Right(LTR), the overall accuracy of statistical approach with simpletagset is 95.06% and 98.02% overall accuracy is obtained usinghybrid approach with simple tagset. Now extended tagset isused with statistical and hybrid approaches and the overallaccuracy of 95.86% and 98.29% is achieved respectively.

    B.DiscussionThe aim of this research was to perform automatic verb

    phrase chunker in Urdu. To get maximum accuracy differentexperiments were conducted using statistical and hybridapproach. The intention was to mark the factors which wereimportant for producing high accuracy. The experiment inwhich simple tagset (IOB), statistical based chunking and rightto left (RTL) scanning is followed is used as base lineexperiment and achieve 95.14% accuracy. It is observed that afew simple rules can give a significant 3% improvement in

    accuracy. So statistical chunking and rules based correction isapplied for best results. The best overall accuracy achieved inall experiments is 98.44%. Merging POS tag with IOB taggives minor improvement in accuracy but reversing scanningdirection decreases accuracy.

    VII. CONCLUSIONIn this paper, we have proposed a hybrid approach to learn

    verb phrase chunking in Urdu which is base on HMM basedstatistical chunking and rule based correction afterwards. Wehave performed different experiments to get maximumaccuracy and found the experiment's scheme based on hybridapproach with extended tagset and Right to Left scanning givesthe best accuracy in all experiment schemes. The best accuracy

    is 98.44%.

    APPENDIX

    The rules for verb phrase chunking are as following:

    1. If () = and (1) = , Thenchunk tag for is.

    2. If () = and (1) = , Thenchunk tag for is.

    3. If () = and (1) = , Thenchunk tag for is.

    4. If () = and (1) = , Thenchunk tag for is.

    5. If () = and (1) = , Thenchunk tag for is.

    6. If () = , (1) = and(2) = , Then chunk tag for is.

    7. If () = , (1) = and(2) = , Then chunk tag for1 is

    and is.

    8.

    If () = , (1) = and(2) = , Then chunk tag for1 is

    and is.

    9. If () = , (1) = and(2) = , Then chunk tag for1 is

    and is.

    10. If () = , (1) = and (2) = , Then chunk tag for1

    is and is.

    11. If () = , (+1) = and(+2) = , Then chunk tag for+1 is.

    12. If () = , (+1) = and(+2) = , Then chunk tag for+1 is.

    REFERENCES

    [1] S. Abney, Parsing by Chunks: Principle based parsing, KluwerAcademic Publishers, Dordrecht.

    [2] M. Butt, The Structure of Complex Predicates in Urdu, Stanford, CA:CSLI Publications, 1995.

    [3] T. Brant, TnT: a statistical part of speech tager, In proceeding of thesixth conference on applied natural language processing, Seattle,Washington, 2000, pp. 224231.

    [4] C. Kuang-hua and C. Hsin-His, A Probablistic Chunker, Inproceedings of ROCLING VI, 1993.

    [5] D. Chakrabarti, H. Mandalia, R. Priya, V. Sarma, and P. Bhattacharyya,Hindi Compound Verbs and their Automatic Extraction,

    Computational Linguistics (COLING08), Manchester, UK, 2008.

    [6] C. Grover and R. Tobin, Rule Based Chunking and Reusability, In proceedings of the Fifth international conference on LanguageResources, 2007.

    [7] J. Veenstra and A. van den Bosch, Single- Classifier Memory-BasedPhrase Chunking, In proceedings of CoNLL-2000 and LLL-2000,2000, pp. 157159.

    [8] K. Church, A stochastic parts program and noun phrase parser forunrestricted text, In proceeings of Second Conference on AppliedNatural Language Processing, pp. 136143.

    [9] A. Muaz, A. Ali and S. Hussain, Analysis and Development of UrduPOS Tagged Corpora, In proceedings of the 7th Workshop on AsianLanguage Resources, IJCNLP09, Suntec City, Singapore, 2009.

  • 8/3/2019 A Hybrid Approach to Urdu Verb Phrase Chunking

    5/5

    [10] P. Seong-Bae and Z. Byoung-Tak, Text Chunking by CombiningHand-Crafted Rules and Memory-Based Learning, In proceedings ofthe 41st Annual Meeting on Association for Computational Linguistics -

    Volume 1, 2003, pp. 497504.

    [11] A. L. Ramshaw and M. P. Marcus, Text chunking using transformation based learning, In proceedings of the third ACL workshop on VeryLarge Corpora, Somerset, NJ, 1995, pp. 8294.

    [12] A. Singh, S. Bendre and R. Sangal, HMM Based Chunker for Hindi,In proceedings of Posters, Intl. Joint Conf. on Natural LanguageProcessing (IJCNLP), 2001.

    [13] T. Kudo and Y. Matsumoto, Chunking with Support Vector Machines,In proceedings of NAACL, 2001, pp. 10131015.

    [14] G. Zhou, J. Su and T. Tey, Hybrid Text Chunking, In proceedings ofCoNLL- 2000 and LLL-2000, pp. 163165.