punjabi pos tagger: rule based and hmm...sapna kanwar[7] developed hmm based part of speech tagger...
TRANSCRIPT
-
Research Article
July 2017
© www.ijarcsse.com, All Rights Reserved Page | 193
International Journals of Advanced Research in
Computer Science and Software Engineering
ISSN: 2277-128X (Volume-7, Issue-7)
Punjabi Pos Tagger: Rule Based and HMM Umrinderpal Singh, Vishal Goyal
Punjabi University, Patiala,
Punjab, India
DOI: 10.23956/ijarcsse/V7I7/0106
Abstract: The Part of Speech tagger system is used to assign a tag to every input word in a given sentence. The tags
may include different part of speech tag for a particular language like noun, pronoun, verb, adjective, conjunction
etc. and may have subcategories of all these tags. Part of Speech tagging is a basic and a preprocessing task of most of
the Natural Language Processing (NLP) applications such as Information Retrieval, Machine Translation, and
Grammar Checking etc. The task belongs to a larger set of problems, namely, sequence labeling problems. Part of
Speech tagging for Punjabi is not widely explored territory. We have discussed Rule Based and HMM based Part of
Speech tagger for Punjabi along with the comparison of their accuracies of both approaches. The System is developed
using 35 different standard part of speech tag. We evaluate our system on unseen data with state-of-the-art accuracy
93.3%.
General Terms: NLP, Part of Speech Tagger Additional Key Words and Phrases: POS, Punjabi, Rule Based, HMM
I. INTRODUCTION
Part of Speech (POS) is a linguistic category of a word based on predefined part of speech tags or word classes. Part of
speech can be categorized into closed word class and open word class. Lexical that conveys the true meaning of the
sentence belongs to open word class like noun, verbs, adjectives etc. The closed word class has small number of classes
to define lexical categories like prepositions, postposition, determiners etc. The Part of Speech tagger assigns a particular
tag to a word based on its context in a sentence. Part of speech tagger is an important system that is used in many Natural
Language Processing tasks like Machine Translations, Information Extraction, Grammar checker, Parsing etc. In most of
the cases, the accuracy of these NLP applications depends upon the accuracy of POS tagger. Therefore, efforts are put to
develop an accurate POS tagger for Punjabi. There are different approaches to implement POS tagger, but we have opted
to use Rule Based and HMM approach. Rule based approach use linguist rules to tag a given word in a sentence and
resolve ambiguities based on these rules. HMM tagger is a machine learning approach based on probability model,
where we train system using annotated corpus.
II. ABOUT PUNJABI LANGUAGE
Punjabi is an Indo-Aryan language spoken by 130 million people worldwide
(http://en.wikipedia.org/wiki/Punjabi_language). Punjabi is tenth widely spoken language in the world and the eleventh
most widely spoken language in India. Punjabi is the mother tongue of the Indian state of Punjab and Pakistan Punjab. In
Pakistan and in Indian Punjab, Punjabi is written in different forms. In India Punjabi is written in Gurmukhi script while
in Pakistan it is written in Shahmukhi script. There are many Punjabi speaking people in the UK and in Canada.
Therefore, there are huge numbers of Punjabi speaking people worldwide and vast amount of literature is written in
Punjabi language. We need to create Natural Language Processing tools so that Punjabi speaking people and other
communities take benefits from this tool. For Punjabi language, much work has been done at
(http://www.learnpunjabi.org/) Punjabi University, Patiala, which has developed many different Punjabi language
systems like Punjabi to Hindi Machine Translation (MT) System, Hindi to Punjabi MT system, Punjabi OCR, Legacy
Font Converter, Grammar Checker etc.
III. RELATED WORK
Mandeep Singh[3,4] developed the first part of speech tagger for Punjabi. This was developed as a module in Punjabi
grammar checker system. Around 630 tags used. This system was developed using rule based approach. Tagger used the
handwritten linguist rules to tag words and to remove the ambiguities. The accuracy of the system was reported to
80.29%. Sapna Kanwar[7] developed HMM based part of speech tagger for Punjabi. Bi-Gram HMM model was used to
build this system. Accuracy of the system was 86.2%. Sharma, S.K[7,8] developed HMM based Punjabi part of speech
tagger. The Bi-Gram HMM model was used to develop the part of speech tagger, which was trained on annotated corpus,
containing 20,000 words. The accuracy of the system is 90.11%. Dinesh Kumar [1] developed part of speech tagger
using neural network. This was the first part of speech tagger was developed using Neural Network approach. Multilayer
Perceptron with fixed context length was used. To train the Neural Network, Back-Propagation learning algorithm was
used. Accuracy of the system was 85.95%. Navneet Garg[6] developed rule based tagger for Hindi, system used lexicon
resources, morphological information and various rules to tag unknown words and to resolve ambiguities. S. Singh[11]
work suggested how morphological information can be useful to cover-up unknown words and how it can be helpful for
resource poor languages. For other part of tagger systems for Indian Languages, reader may refer to Shambhavi. B.R
2010 [16]. 3Significant state-of-the-art POS tagger systems have been developed for English and French language using
dx.doi.org/10.23956/ijarcsse/V7I7/0106
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 194
Wall street journal of Penn Treebank and French Treebank. Thorsten Brants[15] TnT POS tagger trained on 50k
sentences and delivers 96.7% accuracy using
HMM(http://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)).
To the best of our knowledge(http://pgc.learnpunjabi.org/#Tagger), there is only one Punjabi POS tagger system
available online for use. All the Punjabi POS tagger which has been developed were not using any standard POS tag set,
so without any standardization it is difficult to compare one tagger accuracy with the other. The Punjabi annotated corpus
contains 20,000 words Sharma, S.K [7,8] was used to train HMM model, which is not adequate training data for any
statistical model.
In our work, we have used standard tag set proposed by TDIL (Technology Development for Indian Languages) and
annotated data developed under ILCI (Indian Languages Corpora Initiative) consortium of TDIL, DeitY, GoI, New
Delhi, India, for training the POS tagger.
IV. APPROACHES TO POS TAGGER
4.1 Rule Based Approach: The Rule Based approach is based on linguistic rules. Different linguistic rules are used to tag given word in a
sentence[13]. To develop linguistic rules one should have a deep understanding of the target language. Developing the
rule-based system is a time consuming task. These systems mainly use large lexicon and many language specific rules[2].
Creating such large dictionaries and rules need ample time and language specific proficiency. The Rule Based systems
are mostly domain specific and one cannot extend the rule based system developed for specific domain for some other
domain, and these systems are not easily portable.
4.2 Statistical System: Statistical system is an approach for developing Part of Speech tagger. Statistical systems are based upon probabilities.
The system calculates the probability of occurrence of the given word from the training corpus. These systems need large
annotated corpus for training the system. In training data, we always try to incorporate all kinds of training examples
covering all domains, so that the system will be able to perform well for all kinds of inputs. Development of annotated
corpus for training the statistical systems is also a challenging and time consuming task. In statistical system, we have
different probabilistic models like SVM, CRF, MaxEnt, HMM etc.
We have used the trigram HMM model to develop POS tagger for Punjabi. HMM model is an interlocked state and all
the different states are connected through set of transition probabilities. Transition probability defining the probability of
traveling from one state to another state.
The output of HMM is the highest probability state sequence which was hidden corresponding to the given observed
sequence. If we have sentences (s1,s2,......sn) and tag sequence (t1,t2,.......tn) and generative model is defined as:
𝑓 𝑠1 …𝑠𝑛 = argmax𝑡…𝑡𝑛 𝑃 𝑠1 …𝑠𝑛 , 𝑡1 …𝑡𝑛 (1)
Input of s1 to sn we take the highest probability tag sequence as output of model.
Trigram HMM model is defined as:
𝑝 𝑠1 …𝑠𝑛 ,𝑡1 …𝑡𝑛 = 𝑞 𝑡𝑛 𝑡𝑛−2𝑡𝑛−1 𝑒 𝑠𝑖 𝑡𝑖 (2)
𝑖𝑖
𝑞 𝑡𝑛 𝑡𝑛−1𝑡𝑛−2 =𝐹(𝑡𝑛−2𝑡𝑛−1𝑡𝑛)
𝐹(𝑡𝑛−2𝑡𝑛−1) (3)
𝑒 𝑠𝑖 𝑡𝑖 =𝐹(𝑡𝑖 → 𝑠𝑖)
𝐹(𝑡𝑖) (4)
Where 𝐹(𝑡𝑛−2𝑡𝑛−1𝑡𝑛) is number of occurrence of the trigram (𝑡𝑛−2𝑡𝑛−1𝑡𝑛) in corpus and 𝐹(𝑡𝑛−2𝑡𝑛−1) is the number of occurrences of bigram (𝑡𝑛−2𝑡𝑛−1) . A trigram HMM consists of a finite set V of possible words vocabulary, and a finite set of K possible tags. Any trigram (𝑡𝑛−2𝑡𝑛−1𝑡𝑛) such that 𝑡𝑛 ∈ 𝐾 ∪ {𝑠𝑡𝑜𝑝} , 𝑡𝑛−2𝑡𝑛−1 ∈ 𝐾 ∪ {𝑠𝑡𝑎𝑟}. The value for 𝑞 𝑡𝑛 𝑡𝑛−2𝑡𝑛−1 can be interpreted as the probability of seeing the tags immediately after the bigram of tags (𝑡𝑛−2𝑡𝑛−1). 𝐹(𝑡𝑖 → 𝑠𝑖) Define number of occurrence of word 𝑠𝑖 with tag 𝑡𝑖 divided by total occurrence of tag 𝑡𝑖 . For any 𝑠𝑖 ∈ 𝑉, 𝑡𝑖 ∈ 𝐾. The value for 𝑒 𝑠𝑖 𝑡𝑖 can be interpreted as the probability of seeing observation 𝑠𝑖 paired with state 𝑡𝑖 . To find the highest probability tag sequence we used Viterbi Algorithm for decoding the HMM. Viterbi
algorithm defined as:
ALGORITHM 1. Viterbi
Input: a Sentence 𝑥1 … 𝑥𝑛 𝑎𝑛𝑑 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑞 𝑡𝑛 𝑡𝑛−1𝑡𝑛−2 , 𝑒 𝑠𝑖 𝑡𝑖
Define K to set of all tags. 𝐾−1=𝐾0 = (𝑠𝑡𝑎𝑟𝑡)
𝜋 0, 𝑠𝑡𝑎𝑟𝑡, 𝑠𝑡𝑎𝑟𝑡 =1 𝐹𝑜𝑟 𝑘 = 1 … . 𝑛
𝐹𝑜𝑟 𝑎 ∈ 𝐾𝑘−1 , 𝑏 ∈ 𝐾𝑘
𝜋 𝑘,𝑎, 𝑏 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝜋 𝑘 − 1,𝑎, 𝑏 ∗ 𝑞 𝑐 𝑎, 𝑏 ∗ 𝑒(𝑥𝑘|𝑐))
𝑅𝑒𝑡𝑢𝑟𝑛 𝑎𝑟𝑔max(𝜋 𝑛, 𝑏, 𝑐 ∗ 𝑞(𝑠𝑡𝑜𝑝|𝑏, 𝑐))
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 195
4.3 Hybrid System:
Hybrid system is a combination of Rule based and Statistical systems. Hybrid system is trained using annotated corpus,
along with rule based approach where language specific rules are used[14].
V. TAG SET USED BY OTHER TAGGERS FOR PUNJABI
First tag set used for Punjabi was proposed in 2008[3]. That tag set included many new tags for Punjabi, reason behind
the development of a new tag set was that, the existing tag set for other languages like Penn treebank tag set can't be
directly used for Punjabi grammar checking. Moreover, there was no well defined and standardized tag set available in
any Indian language [3]. Around set of 630 tags were used in this tagger, to choose a best possible tag for a word was a
challenging task for a tagger. Word specific tags were also included in tag set, otherwise we use only one tag that
represent postposition word, but in this tag set all the postposition having a different tag name for different words.
Following Table I shows the excerpt of the tag set.
Table I: Word Specific Tags
Word POS Description
ਦਾ PPIDA Post Position Inflected DA
ਦੀ PPIDA Post Position Inflected DA
ਨੇ PPUNE Post Position Uninflected NE
ਨੂੂੰ PPUNU Post Position Uninflected NU
ਤੋਂ PPUTON Post Position Uninflected TON
ਕੁ PTUKU Particle Uninflected KU
ਕ PTUKE Particle Uninflected KE
VI. POS TAG SET USED FOR OUR TAGGER
TDIL (Technology Development for Indian Languages) is the Government body constituted by the DeitY (Department
of Electronics and Information Technology), MoC&IT (Ministry of Communications and Information Technology), GoI
(Government of India), New Delhi, India for setting up standards and development of language resources for Indian
languages using the expertise of language and Engineering experts working in various reputed institutes/Universities
throughout India Like IITs, NITs, State Universities, Autonomous bodies etc. The TDIL has released BIS Part of Speech
tag set for Indian Languages available on their website for using it freely after getting their permission. This is the first
standard Part Of Speech (POS) tag set proposed for Indian languages after the contribution of various organizations
across India. 22 Indian languages are included in this POS tag set standardization process, and Punjabi is one of them. In
Punjabi POS tag set, 35 tags have been defined. There are 11 main categories in tag set and all these main categories are
further divided into sub-categories like verb category is divided into five different categories.
Table II: Verb POS Categories
POS Category Main-Category Sub Category POS Tag
Verb
Main Verb V_VM
Non-Finite V_VM_VNF
Infinite V_VM_VINF
Gerund V_VM_VNG
Auxiliary Verb V_VAUX
All 35 tags are given in Appendix 1. We have used this standard POS tag set for implementing POS tagger.
VII. ANALYSIS OF ANNOTATED CORPUS
ILCI (Indian Language Corpora Initiative) Consortium, led by Jawahar Lal Nehru University, New Delhi, covering 12
Indian Languages, composed of 17 institutions. The main aim of these consortium's members to develop the annotated
parallel corpora in 12 Indian Languages. This consortium has already completed the annotation task of 50K sentences for
Health and Tourism domain. Moreover, this consortium is working on Entertainment and Agriculture domain. We have
used Punjabi annotated corpus submitted by this consortium to TDIL which has further been evaluated from various
linguists. The corpus has been tagged using different 35 standard tags of Punjabi, corpus includes 49319 sentences and
around 63000 unique words. Following is the analysis of this annotated corpus in the tables below:
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 196
Table III: Most Frequent Word-Tag pair
S.No. Words POS Freq
1 | RD_PUNC 49319
2 (Hai) V_VAUX 35737
3 ਦ(Dē) PSP 31256 4 , RD_PUNC 21113
5 ਅਤ(Atē) CC_CCD 16358
6 ਦੀ(Dī) PSP 16161
7 ਵਿੱ ਚ(Vica) PSP 15542
8 ਦਾ(Dā) PSP 14886
9 ਨੂੂੰ (Nū) PSP 12590
10 ਵਚ(Vica) PSP 12096
Chart 1: Percentage of Occurrence of Tags in Training Corpus
29.271.78
0.150.57
0.310.42
0.010.190.01
1.440.270.35
0.0212.25
2.440.030.01
5.267.74
1.6616.83
2.861.020.84
0.010.28
1.211.16
1.950.230.21
1.058.16
00.01
0 5 10 15 20 25 30
N_NNN_NNPN_NST
PR_PRPPR_PRFPR_PRLPR_PRCPR_PRQPR_PRI
DM_DMDDM_DMRDM_DMQDM_DMI
V_VMV_VM_VNF
V_VM_VINFV_VM_VNG
V_VAUXJJ
RBPSP
CC_CCDCC_CCSRP_RPDRP_INJ
RP_INTFRP_NEGQT_QTFQT_QTCQT_QTORD_RDFRD_SYM
RD_PUNCRD_UNKRD_ECH
Per -Tag % in Corpus
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 197
Chart 1 shows percentage of per tag in corpus. As we can see that N_NN and PSP tags are most frequently used tags.
During this analysis, we also found different ambiguous words along with their frequency. These ambiguous words help
us to create different linguistic rules to resolve ambiguities. Some ambiguous words are shown in Table IV.
Table IV: Ambiguous Words
S.No. Word POS Freq
1 ੀਰਾ(Hīrā) N_NN 4
2 ੀਰਾ(Hīrā) N_NNP 2
3 ੀ(Hī) RP_RPD 4497
4 ੀ(Hī) QT_QTF 6
5 ੀ(Hī) V_VAUX 3
6 ੀ(Hī) N_NN 2
7 ੀ(Hī) V_VM 2
Following Table V shows most frequent words in the corpus. Here we are not considering punctuation marks and
symbols otherwise ' | ' punctuation mark is most frequent one which has occurred 49319 times.
Table V: Most Frequent Words
S.No. Word Freq
1 (Hai) 37768
2 ਦ(Dē) 49296
3 ਅਤ(Atē) 20488
4 ਦੀ(Dī) 33332
5 ਵਿੱ ਚ(Vica) 16952
6 ਦਾ(Dā) 44096
7 ਨੂੂੰ (Nū) 16848
8 ਵਚ(Vica) 28912
9 ਨ(Hana) 11492
The purpose of this analysis was to create some tagging rules for unknown and for ambiguous words. We have used all
these rules in HMM and Rule Based tagger to tag unknown words which are not the part of the training data.
VIII. RULE BASED TAGGER
The Rule Based system uses large gazetteer list and various linguistic rules to tag input words. We have developed
linguistic rules to tag given input text and a large gazetteer list of word-tag mapping. This list contains 73713 unique
words along with their tags. By the analysis of annotated corpus, we got various ambiguous words that can be tagged to
different word classes. We create an ambiguous word list; this list contains 20214 unique words. Ambiguous words are
27% of all unique words. To remove the ambiguity and assign a correct tag to a word we have developed various rules.
For example, to remove ambiguity of PR_PRP tag, we have used following rules to tag it as DM_DMD:
CW="PR_PRP" AND NW="PR_PRP" THEN CW=" DM_DMD"
CW="PR_PRP" AND NW="N_NN" THEN CW=" DM_DMD"
Here CW, PW, NW are Current Word, Previous Word and Next Word respectively. Analysis of the corpus shows, this
rule is always valid to disambiguate the PR_PRP tag. Figure 1 shows the Rule-Based POS tagger.
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 198
Fig 1: Rule Based System
Firstly, the system finds the current word in word-tag mapping lexicon. This lexicon contains unique and unambiguous
words, if word found in the lexicon we simply tag that word with its POS, otherwise the system verify that word in the
ambiguous word list collection and apply various rules to resolve its ambiguity.
To tag unknown words, which are not part of our lexicon, we have developed many rules to tag all
these words. These rules are also used to resolve ambiguity of any given word tag token. Following rules has been used
to tag unknown and ambiguous words.
Rule 1: if PW="N_NN" and NW="N_NN" then CW="PSP"
Exp: ਉਨਹ ਾਂ\DM_DMD ਦੀਆ\ਂPSP ਅਿੱਖਾਂ\N_NN ਵਚ\PSP ੂੰ ਝੂ\N_NN ਨ\V_VAUX { Unhāṁ\DM_DMD dī'āṁ\PSP akhāṁ\N_NN vica\PSP hajhū\N_NN}
English: Tears in their eyes.
Rule 2: if PW=" V_VM_VF" and NW=" RD_PUNC" then CW=" V_VAUX"
Exp: ਕਾਰਨ\N_NN ਫਣ\V_VM ਸ਼ਕਦਾ\V_VM_VF \V_VAUX |\RD_PUNC { Kārana\N_NN baṇa\V_VM sakadā\V_VM_VF hai\V_VAUX |\RD_PUNC }
English: This may cause of
Rule 3: if PW=" PSP" and NW="N_NN" then CW=" PSP"
Exp: ਉਸ਼\PR_PRP ਦੀ\PSP ਘਰ\N_NN ਾਲੀ\PSP {Usa\PR_PRP dī\PSP ghara\N_NN vālī\PSP}
English: His wife.
Rule 4: if PW="PSP" and NW="N_NN" then CW="JJ"
Exp: ਸ਼ਰਕਾਰ\N_NN ਦੀਆ\ਂPSP ਅਵਭ\JJ ਰਾਤੀਆ\ਂN_NN { Sarakāra\N_NN dī'āṁ\PSP ahima\JJ prāpatī'āṁ\N_NN }
English: Main achievement of government.
Rule 5: if PW="CC_CCD" and NW="PSP" then CW="N_NN"
Exp: ਭਸ਼ੂੜ\N_NN ਅਤ\CC_CCD ਦੂੰਦਾਂ\N_NN ਦੀਆ\ਂPSP ਿੱਡੀਆ\ਂN_NN { Masūṛē\N_NN atē\CC_CCD dadāṁ\N_NN dī'āṁ\PSP haḍī'āṁ\N_NN}
English: Gums and teeth bones.
Rule 6: if NW="PSP" and NNW="N_NN" then CW="N_NN"
Exp: ਫੁਤ\QT_QTF ਲਕ\N_NN ਦੂੰਦਾਂ\N_NN ਦੀ\PSP ਦਖਬਾਲ\N_NN ਵਚ\PSP {Bahutē\QT_QTF lōka\N_NN dadāṁ\N_NN dī\PSP dēkhabhāla\N_NN vica\PSP}
English: Caring of teeth.
Rule 7: if NW="PSP" and NNW="JJ" then CW="N_NN"
Exp: ਦੂੰਦਾਂ\N_NN ਵਚ\PSP ਠੂੰ ਢਾ\JJ -\RD_PUNC ਗਰਭ\JJ ਾਣੀ\N_NN
Rule Based Tagger Rules# 1...n
Input Text
IextText
Tagged output
Tokenization and Normalization
Named Entities
Word-Tag
Mapping
Ambiguous Word List
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 199
{ Dadāṁ\N_NN vica\PSP ṭhaḍhā\JJ -\RD_PUNC garama\JJ pāṇī\N_NN}
English: Sensitivity in teethes.
Rule 8: if PW="N_NN" and NW="PSP" then CW="JJ"
Exp: ਰਗਾਂ\N_NN ਦਾ\PSP ਭੂਲ\JJ ਕਾਰਨ\N_NN ਇਕ\QT_QTC { Rōgāṁ\N_NN dā\PSP mūla\JJ kārana\N_NN ika\QT_QTC}
English: Main cause of disease.
After adding all these rules, the system performed well for unknown and for ambiguous words, but system still left many
words untagged or choose wrong tags. The main problem was in (N_NNP) proper nouns, name of the locations and
person. Therefore, we collect Punjabi person's name along with their surname and create a list of popular cities and
towns. After using this proper name list, system is able to tag proper noun words correctly. We have developed rules to
tag location and unknown person names as well, which are not part of the dictionary. These rules related to Named Entity
Recognition also help us to resolve ambiguity in the person's name like the word "ਲਾਲ"(lal) (meaning: Red) is an ambiguous word. It can be a noun or it can be a proper noun. Table VI shows Information about the collection of Named
Entities and Table VII shows some ambiguous NE words and candidate POS tags found in corpus.
Table VI. Proper Name Collection
S.No. Proper Noun Tag Total
1 Person Names 3012
2 Locations 121
3 Month 12
4 Days 9
5 Person Last name 150
6 Middle Name 10
Table VII. Ambiguous Proper Noun
S.No. Word POS_1 POS_2 POS_3
1 ਲਾਲ(Lāl) N_NNP N_NN JJ 2 ੀਰਾ(Hīrā) N_NNP N_NN 3 ਤਾਰਾ(Tārā) N_NNP N_NN 4 ਲਾਬ(Lābha) N_NNP N_NN JJ
Alongside this, we have used some morphological features. Suffix stripping is used to determine the POS tag of OOV
(out of vocabulary) words. To tag unknown location, we create various rules, based on the suffix of that that word. In
Punjab as well as in other States of India UmrinderPal [9], most of the location names usually end with suffix "ੁਰ"(pur), "ਫਾਦ"(abad) etc. Therefore, we have used these clues to tag them as proper nouns.
Following Rules help to tag person name with the help of last name.
Rule 1: if CW="Last Name" then PW="N_NNP"
Exp: ਗੁਜਰਾਤ\N_NNP ਦ\PSP ਭੁਿੱ ਖ\JJ ਭੂੰਤਰੀ\N_NNP ਨਵਰੂੰਦਰ\N_NNP ਭਦੀ\N_NNP ਨੇ { Gujarāta\N_NNP dē\PSP mukha\JJ matarī\N_NNP naridara\N_NNP mōdī\N_NNP nē }
English: Narinder Modi Chief Minister of Gujarat.
Rule 2: if CW="Middle Name" then PW="N_NNP" AND NW=N_NNP
Exp: ਰਣਫੀਰ\N_NNP ਲਾਲ\N_NNP ਗੁਰਦਾਸ਼ੁਰ\N_NNP { Raṇabīra\N_NNP lāla\N_NNP guradāsapura\N_NNP }
English: Ranbir lal Gurdaspur
Following rules based on suffix of the word to tag locations:
Rule 1: if CW="Ends with ੁਰ" then CW="N_NNP" Exp: ਭਣੀੁਰ\N_NNP ਦ\PSP ਲਕ\N_NN { Maṇīpura\N_NNP dē\PSP lōka\N_NN }
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 200
English: People of Manipur.
Rule 2: if CW="Ends with ੁਰਾ" then CW="N_NNP"
Exp: ਉ\PR_PRP ਵਕਵਨੁਰਾ\N_NNP ਵਚ\PSP ਰਵਦਾ\V_VM_VF \V_VAUX ।\RD_PUNC
{Uha\PR_PRP kiśanapurā\N_NNP vica\PSP rahidā\V_VM_VF hai\V_VAUX।\RD_PUNC} English: He lives in Kishanpura.
Rule 3 if CW="Ends with ਫਾਦ" then CW="N_NNP"
Exp: ਉ\PR_PRP ਅਵਭਦਾਫਾਦ\N_NNP ਵਗਆ\V_VM_VF ।\RD_PUNC
{Uha\PR_PRP ahimadābāda\N_NNP gi'ā\V_VM_VF।\RD_PUNC} English: He went to Ahmadabad.
To tag OOV words that ends with "ਾਂਗ"(Vāṅgē) , "ਾਂਗਾ"(Vāṅgā) , "ਾਂਗੀ"(Vāṅgī), tagged as Infinitive Verb, for example: ਲਾਂਗ(Lavāṅgē)("will take"), ਲਾਂਗੀ(Lavāṅgī)(will take; feminine), ਦਾਂਗਾ(Dēvāṅgā)(will give). Words ends with
"ਾਕ(Vākē)", tagged as Non-finite Verb, for example: ਫਣਾਕ(Baṇavākē).
IX. HMM BASED TAGGER
Along with the Rule based system, we have developed Trigram HMM based tagger, a statistical tool used for sequence
categorized by a set of observable sequence described in section 4. HMM has rich history in sequence data modeling [10]
and very successfully used by many researchers [5,12,15]. We have developed HMM system using C#.net. To train the
system we used annotated corpus provided by TDIL. Corpus has 49 thousand sentences. Around 44387 sentences has
been used to train the HMM system, which was 90% of the total available corpus.
Fig 2: HMM based System
Tokenization and Normalization module clean and tokenize the text into sentences and words for further processing.
Vitirbi algorithm is used for decoding process. To cover up new and unknown words, which are not part of the training
data, we used same rules, which have been developed in rule-based tagger. Suffix of the word used as features to tag
token as NE. With the help of these rules, we are able to cover up all the unknown words.
X. EVALUATION Evaluation is done on 10% of the corpus data. Test data includes 4932 sentences of Health and tourism domain. Besides
this, we have also collected test data related to Health and Tourism News from the online news website
http://www.ajitjalandhar.com. This data contains 250 sentences. Detail of testing data is in following Table VIII:
HMM Tagger Rules# 1...n
Input Text
Tokenized into Words
Tagged Output
word is unknown
Training Data
Tokenized into Sentence
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 201
Table VIII. Detail of testing data
Test Set Domain No. of Sentences No. of Words
Set 1 Health Related Data 2098 36023
Set 2 Tourism Related Data 3084 40144
Total 5182 76167
We have used 150 rules to tag unknown words and to resolve ambiguity. During the testing phase, we come to know that
some of the rules failed in some typical sentences. We have done analysis showing that which rule is frequently used and
which one is least used. Accuracy and rules applied to tag words in testing data may vary based on test data domain. In
testing data, 699 words were unknown or new, those are tagged using assorted rules. Analyses of the rules, which have
been used throughout the testing phase of HMM and Rule Based Tagger to tag unknown words, are shown in following
chart.
Chart 2. Rules used in Testing
Rule 1 is most frequently used rule for tagging unknown word, defined as
if NW="PSP" then CW="N_NN"
Similarly Rule 2 is most used rule which is
if NW="N_NN" then CW="PSP"
Total 29 rules are used out of 150 rules during testing. Rules details are presented in Appendix 2. Sometimes the system
failed to tag the given word with correct POS for example:
Punjabi Sentence: ਭੈਂ ਨੀਂ ਵਦਿੱਲੀ ਵਗਆ|(Maiṁ navīṁ dilī gi'ā) English: I went to New Delhi.
Output: ਭੈਂ\PR_PRP ਨੀਂ\JJ ਵਦਿੱਲੀ\N_NNP ਵਗਆ\V_VM_VF ।\RD_PUNC
Here New Delhi is a Proper noun, but system tagged word New as JJ (Adjective), which is also a valid candidate tag for
word New.
The accuracy of the system has been calculated using standard evaluation metrics viz. Precision, Recall and F1-Score.
When system assigns correct tags to the given words known as TP (True Positive). However, system might assign
incorrect tags known as FP (False Positive), and when the system does not assign any tag to a given word known as FN
(False Negative). F2-Score of test's accuracy, it include both precision and recall of the test to compute the score.
𝑅𝑒𝑐𝑎𝑙𝑙 𝑅 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁 (5)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑃 =𝑇𝑃
𝑇𝑃 + 𝐹𝑃 (6)
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 2 ∗(𝑅 ∗ 𝑃)
(𝑅 + 𝑃) (7)
179
89
15 10
3138 39
19 179 11 7 13 12
25
8 8
29 2914 14 12 12 8 12 11 4 4
20
Ru
le 1
Ru
le 2
Ru
le 3
Ru
le 5
Ru
le 7
Ru
le 8
Ru
le 9
Ru
le 1
0R
ule
11
Ru
le 1
2R
ule
13
Ru
le 1
4R
ule
15
Ru
le 1
7R
ule
18
Ru
le 1
9R
ule
20
Ru
le 2
2R
ule
23
Ru
le 2
6R
ule
27
Ru
le 2
8R
ule
29
Ru
le 3
1R
ule
34
Ru
le 3
5R
ule
36
Ru
le 3
8R
ule
142
0
20
40
60
80
100
120
140
160
180
200
Rule Used in Testing
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 202
Table IX: Accuracy using Rule Based System
Test Set Precision Recall F1-Score
Set 1 0.867 0.984 0.922
Set 2 0.92 0.929 0.924
Table X: Accuracy using HMM based System
Test Set Precision Recall F1-Score
Set 1 0.869 0.993 0.927
Set 2 0.921 0.945 0.933
Chart 3. Accuracy of the System
Chart 4. Accuracy of the Rules
92.2
92.4
92.7
93.3
Set 1
Set 2
HMM
RuleBased
9997
9092
100100100100
98100100100
99100
98100
98100100
9897
95100100100100
9899
100
84 86 88 90 92 94 96 98 100 102
Rule 1Rule 2Rule 3Rule 5Rule 7Rule 8Rule 9
Rule 10Rule 11Rule 12Rule 13Rule 14Rule 15Rule 17Rule 18Rule 19Rule 20Rule 22Rule 23Rule 26Rule 27Rule 28Rule 29Rule 31Rule 34Rule 35Rule 36Rule 38
Rule 142
Accuracy of Rules
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 203
XI. CONCLUSION AND FUTURE WORK
We have developed POS tagger for Punjabi using two different approaches. Rule Based approach has been used where
various rules to tag unknown words and to resolve ambiguity has been developed. HMM model is a statistical model,
which is much simpler and faster than other statistical models to find out the correct tag of given word. Results shows
that 49 thousand annotated text data is quite enough to train the HMM system. We have concluded from the evaluation
results that both HMM system performed well as compared to Rule Based systems. We have achieved maximum F1-
Score 0.933 on Tourism domain's data and 0.927 on Health domain data. Thus, the results may vary for other domains.
To the best of our knowledge, this is maximum accuracy achieved by any POS tagger for Punjabi developed so far. For
future work, we will try incorporate more morphological features to cover-up OOV words and to cover up other domains
like science, agriculture, entertainment etc., we will try to collect annotated data of all these domains.
ACKNOWLEDGEMENT This research work is an effort for the automation of the objective of the project funded to the author Vishal Goyal by
TDIL, DeitY, MoC&IT, GoI, New Delhi, India. The project titled "Indian Languages Corpora Initiative (ILCI) - Phase
II" is an initiative under the consortium of 17 Universities / Institutes led by JNU, New Delhi, India for developing the
annotated corpora for Agriculture and Entertainment domain manually. The annotation task for Health and Tourism
domain has already been completed in Phase I of this project. We are thankful to TDIL for supporting us and providing
the annotated corpus of Health and Tourism domain which is used for training the tagger.
REFERENCES
[1] Dinesh Kumar and Gurpreet Singh Josan, 2010. Part of Speech Taggers for Morphologically Rich Indian Languages: A Survey , International Journal of Computer Applications (0975 – 8887) Volume 6– No.5, 1-7
[2] Eric Brill, 1992. A Simple Rule-Based Part of Speech Tagger, in HLT '91 Proceedings of the workshop on Speech and Natural Language, 112-116
[3] Mandeep Singh Gill, Gurpreet Singh Lehal and Shiv Sharma Joshi, 2008. Part of Speech Taggset for Grammer Checking of Punjbai, Apeejay Journal of Management and Technology vol.3, No.2, 146-152
[4] Mandeep Singh Gill and Gurpreet Singh Lehal, 2008. A Grammar Checking System for Punjabi, Coling 2008: Companion volume – Posters and Demonstrations, 149–152
[5] Manish Shrivastava, Pushpak Bhattacharyya, 2008. Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge, In Proceedings of ICON-2008: 6th
International Conference on Natural Language Processing, Macmillan Publishers, India.
[6] Navneet Garg, Vishal Goyal and Gurpeet Singh Lehal, 2012. Rule Based Hindi Part of Speech Tagger, Proceedings of COLING 2012: Demonstration Papers, 163–174
[7] Sapna Kanwar, Mr Ravishankar and Sanjeev Kumar Sharma, 2011. POS Tagging of Punjabi language using Hidden Markov Model, Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913
Issue July 2011, Vol. 1. 98-106
[8] Sanjeev Kumar Sharma and Gurpreet Singh Lehal, 2011. Using Hidden Markov Model to improve the accuracy of Punjabi POS Tagger, Computer Science and Automation Engineering (CSAE), 2011 IEEE International
Conference, Vol.2. 697-701
[9] UmrinderPal Singh, Vishal Goyal and Gurpreet Singh Lehal, 2012. Named Entity Recognition System for Urdu, In Proceedings of COLING 2012 Technical Papers, 2507–2518
[10] L. Rabiner, 1989. A Tutorial on Hidden Markov Model and Selected Application in Speech Recognition, in Proceeding on the IEEE, Vol. 77, Issue. 2, 257-286
[11] S. Singh , K. Gupta , M. Shrivastava and P. Bhattacharya, 2006. Morphological Richness offsets Resources Demand- Experiences in Construction a POS Tagger for Hindi, In proceeding of COLING 2006, 779-786
[12] Manish Shrivastava and Pushpak Bhattacharyya, 2008. Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information Without Extensive Linguistic Knowledge, In proceeding of ICON 08, Pune, India,
December, 2008
[13] A. Bharati, V. Chaitanya, R. Sangal, 1995. Natural Language Processing : A Paninian Perspective . Prentice Hall India
[14] Sandipan Dandapat, Sudeshna Sakar and Anupam Basu, 2004. A hybrid model for part-of-speech tagging and its application to Bengali, in proceeding of International Conference on computation intelligence, 169-172
[15] Thorsten Brants, 2000. TnT -- A Statistical Part-of-Speech Tagger, In proceeding of the 6th Applied NLP Conference, 224-231
[16] Anne Abeille,´ Nicolas Barrier, 2003. Building a Treebank for French, Text, Speech and Language Technology, Vol 20. 165-187
[17] Shambhavi. B.R, Dr. Ramakanth Kumar P, 2010. Current State of Art POS Tagging for Indian Languages - A Study, International Journal of Computer Science and Technology, Vol 1. 250-260
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 204
Appendix 1: Tag Set Proposed by TDIL, DeitY, GoI, New Delhi, India
No. Tag Tag Description Example
1 N_NN Common Noun ਘਰ ਵਕਤਾਫ(Ghara kitāba)
2 N_NNP Proper Noun ਵਰੂੰਦਰ,ਵਦਿੱਲੀ, ਤਾਜਵਭਲ(Harivadara,dilī,
tājamihala)
3 N_NST Noun loc ਉਤ ਥਿੱਲ ਅਿੱਗ ਵਿੱ ਛ(Utē thalē agē pichē)
4 PR_PRP Personal Pronoun ਭ ਤੁੂੰ ਉ(Mai tu uha)
5 PR_PRF Reflexive Pronoun ਆਣਾ ਆ ਖੁਦ(Āpaṇā āpa khuda)
6 PR_PRL Relative Pronoun ਜ, ਵਜਸ਼ ਵਜਡ਼, ਜਦੋਂ,( Jō, jisa jihaṛa, jadōṁ)
7 PR_PRC Reciprocal Pronoun ਆਸ਼(Āpasa)
8 PR_PRQ Wh-word Pronoun ਕਣ ਕਦੋਂ ਵਕਿੱ ਥ(Kauṇa kadōṁ kithē)
9 PR_PRI Indefinite ਕਈ, ਵਕਸ਼(Kō'ī, kisa)
10 DM_DMD Deictic Demonstrative ਇ ਉ(Iha uha)
11 DM_DMR Relative Demonstrative ਜ ਵਜਸ਼(Jō jisa)
12 DM_DMQ Wh-word Demonstrative ਕਣ(Kauṇa)
13 DM_DMI indefinite Demonstrative ਕਈ ਵਕਸ਼(Kō'ī kisa)
14 V_VM Main Verb ਆਇਆ ਜਾ ਕਰਦਾ ਭਾਰਾਂਗਾ ਵਰੂੰਦਾ(Ā'i'ā jā
karadā mārāṅgā rihadā)
15 V_VM_VNF Non-finite Verb
ਜਾਵਦਆ,ਂ ਆਵਦਆ,ਂਕਵਰਦਆ,ਂ ਖਾਕ,
ਜਾਕ(ādi'āṁ, ādi'āṁ,karida'āṁ, khākē, jākē)
16 V_VM_VINF Infinitive Verb ਵਗਆਂ ਆਇਆਂ ਵਕਰਆ(ਂGi'āṁ ā'i'āṁ kira'āṁ)
17 V_VM_VNG Gerund Verb ਜਾਣੋਂ ਖਾਣੋਂ ੀਣੋਂ ਭਰਨੋਂ (jāṇōṁ khāṇōṁ pīṇōṁ
maranōṁ)
18 V_VAUX Auxiliary Verb , ਸ਼ੀ, ਵਸ਼ਕਆ ਇਆ(Hai, sī, sika'ā hō'i'ā)
19 JJ Adjective ਸ਼ਣਾ ਚੂੰਗਾ ਭਾਡ਼ਾ ਕਾਲਾ(Sōhaṇā cagā māṛā
kālā)
20 RB Adverb ਲ਼ੀ ਕਾਲੀ(Hauḻī kāhalī)
21 PSP Postposition ਨੇ ਨੂੂੰ ਤ ਨਾਲ(Nē nū tō nāla)
22 CC_CCD Co-ordinator ਅਤ ਜਾਂ(Atē jāṁ)
23 CC_CCS Subordinator ਵਕਉਵਕ ਵਕ ਜ ਤਾਂ(Ki'uki ki jō tāṁ)
24 RP_RPD Default Particles ੀ ਤਾਂ ੀ(Vī tāṁ hī)
25 RP_INJ Interjection Particles ਉਏ ਅਵਡਆ ਨੀ ਜਨਾਫ(U'ē aḍi'ā nī janāba)
26 RP_INTF Intensifier Particles ਫੁਤ ਫਡਾ(Bahuta baḍā)
27 RP_NEG Negation ਨੀਂ ਨਾ ਵਫਨਾਂ ਗਰ(Nahīṁ nā bināṁ
vagaira)
28 QT_QTF General ਥਡ਼ਾ ਫੁਤਾ ਕਾਪੀ ਕੁਝ ਇਿੱਕ ਵਲਾ(Thōṛā
bahutā kāphī kujha ika pihalā)
-
Singh et al., International Journal of Advanced Research in Computer Science and Software Engineering7(7)
ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0106, pp. 193-205
© www.ijarcsse.com, All Rights Reserved Page | 205
39 QT_QTC Cardinals ਇਿੱਕ ਦ ਵਤੂੰ ਨ(Ika dō tina)
30 QT_QTO Ordinals ਵਲਾ ਦੂਜਾ(Pihalā dūjā) 31 RD_RDF Foreign word Residuals
32 RD_SYM Symbol Residuals $, &, *, (, )
33 RD_PUNC Punctuation ., : ;
34 RD_UNK Unknown
35 RD_ECH Echo-words ਾਣੀ-ਧਾਣੀ , ਚਾ-ਚੂ(Pāṇī-dhāṇī, cāha-
cūha)
Appendix 2 : Rules used by System
Rules to tag Unknown Words:
Rule 1: if NW="PSP" then CW="N_NN"
Rule 2: if NW="N_NN" then CW="PSP"
Rule 18: if PW="PSP" and NW="PSP" then CW="N_NN"
Rule 19: if PW="N_NN" and NW="PSP" then CW="N_NN"
Rule 20: if PW="JJ" and NW="PSP" then NW="N_NN"
Rule 22: if PW="PSP" and NW="N_NN" then CW="N_NN"
Rule 23: if PW="RD_PUNC" and NW="PSP" then CW="N_NN"
Rule 26: if PW="V_VM_VF" and NW="V_VAUX" then CW="V_VM_VF"
Rule 27: if PW="N_NNP" and NW="PSP" then CW="N_NNP"
Rule 28: if PW="RD_PUNC" and NW="N_NN" then CW="N_NN"
Rule 29: if PW="V_VM_VNF" and NW="N_NN" then CW="PSP"
Rule 31: if PW="CC_CCD" and NW="PSP" then CW="N_NN"
Rule 34: if PW="N_NN" and NW="RB" then CW="PSP"
Rule 35: if PW="JJ" and NW="N_NN" then CW="N_NN"
Rule 36: if PW="JJ" and NW="RD_PUNC" then CW="V_VAUX"
Rule 38: if PW="PSP" and NW="N_NN" then CW="JJ"
Rule 142: if PW="DM_DMD" and NW="QT_QTC" then CW="PSP"
Rules to resolve Ambiguity:
Rule3: if CW="JJ" then NW="N_NN"
Rule 5: if CW="PR_PRL" then NW="N_NN"
Rule 7: if CW="PR_PRF" then NW="N_NN"
Rule 8: if CW="PR_PRP" then NW="N_NN"
Rule 9: if CW="PSP" then PW="N_NN"
Rule 10: if CW="V_VM" then PW="N_NN"
Rule 11: if CW="N_NN" then PW="N_NN" or NW="N_NN"
Rule 12: if CW="PR_PRP" and NW="PR_PRP" then CW="DM_DMD"
Rule 13: if CW="PR_PRP" and NW="N_NN" then CW="DM_DMD"
Rule 14: if CW="Last Name" then PW="N_NNP"
Rule 15: if CW="Middle Name" then PW="N_NNP" AND NW="N_NNP"
Rule 17: if CW="DM_DMD" and NW="PSP" then CW="PR_PRP"