sanskrit tag-sets and part-of-speech tagging methods - a survey

6
International Journal of Innovative and Emerging Research in Engineering Volume 2, Issue 1, 2015 46 Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-ISSN: 2394 - 3343 p-ISSN: 2394 - 5394 Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A Survey Sulabh Bhatt a , Krunal Parmar a and Miral Patel b a Department Of Computer Science, Gujarat University, Ahmedabad, India b Department Of I.T., G H Patel College of Engineering and Technology, Vallabh Vidhyanagar, India ABSTRACT: This paper presents some great features of Sanskrit- one of the oldest language of the world from Natural Language Processing (NLP) perspective. Part Of Speech (POS) tagging is the most initial step for developing any NLP application. POS tagger assigns a tag like noun, verb, pronoun, adjective or other category that best suits to the word and also the context of the sentence to which it belongs. Moreover, this paper also provides brief introduction to various approaches and the working of two most famous statistical methods used for POS tagging: Hidden Markov Model (HMM) and Conditional Random Fields (CRF). Keywords: NLP, POS tagger, tagset, Sanskrit, Stochastic Tagging, n-gram, HMM, CRF I. INTRODUCTION Natural Language Processing (NLP): The aim of Natural Language Processing is to make the computers understand how humans naturally speak, write, and communicate. The main goal of NLP is to bridge the human-computer gap by reducing extra efforts needed by a human to communicate with a computer. Major applications of NLP are Machine translation, Named entity recognition (NER), Part of speech (POS) tagging, Question answering, etc. Most of the research have been done for English language, but English has ambiguous grammar so we need a strong and unambiguous grammar which is provided by maharishi Panini in the form of Ashtadhyayi [4]. Sanskrit grammar is the most simplest and ancient grammar in the world. Unlike other languages of the world, there has never been any kind of change in the structure of Sanskrit grammar. One of the most important and initial activities in processing a natural language is POS tagging. A POS Tagger assigns a POS tag to each word of a sentence. The design of the annotated Sanskrit corpus is rapidly going on in India [5]. Natural languages are basically very complicated and Sanskrit also falls in the same group. In the development of a POS Tagger, the quality of the POS annotation in a corpus is very much important. A tagged corpus can be used for many activities. A good tagger can serve as a base for many NLP related activities. It can be also used for linguistic research. Moreover, it can be used further for stemming in information retrieval (IR), since knowing a word’s part of speech can help in telling us which morphological axes it can take. Automatic POS taggers can be used for developing automatic word-sense disambiguating algorithms; POS tagging can also be used for developing text to speech related activities. II. DISCUSSION OF SANSKRIT TAG-SETS Based on some general principles for tagset design strategy, many POS tagsets have been developed by different organizations. This section provides a survey of some tagsets developed especially for Sanskrit or for Indian natural languages and then their evaluation with respect to Sanskrit. Tags are shown in figures in their respective sections. A. JNU Sanskrit tagset(JPOS) This POS tagset is developed by Dr. R. Chandrashekar as a part of his Ph.D. thesis ‘Part -of-Speech Tagging for Sanskrit’ (2007) [19]. This tagset very strictly follows Paninian grammatical rules. This tagset is fine-grained and captures linguistic features as per the traditional grammar of Sanskrit language. This tagset can be classified as per the morphological structure of Sanskrit words. This tagset contains three kinds of tags. Word class main tags, feature sub-tags and punctuation tags [5]. This tagset is having 134 tags. 65 word class tags, 43 feature sub-tags and 25 punctuation tags and one tag AJ to tag unknown words [18].

Upload: sujal1310

Post on 20-Dec-2015

28 views

Category:

Documents


5 download

DESCRIPTION

This paper presents some great features of Sanskrit- one of the oldest language of the world from Natural Language Processing (NLP) perspective. Part Of Speech (POS) tagging is the most initial step for developing any NLP application. POS tagger assigns a tag like noun, verb, pronoun, adjective or other category that best suits to the word and also the context of the sentence to which it belongs. Moreover, this paper also provides brief introduction to various approaches and the working of two most famous statistical methods used for POS tagging: Hidden Markov Model (HMM) and Conditional Random Fields (CRF).

TRANSCRIPT

Page 1: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

46

Available online at www.ijiere.com

International Journal of Innovative and Emerging

Research in Engineering e-ISSN: 2394 - 3343 p-ISSN: 2394 - 5394

Sanskrit Tag-sets and Part-Of-Speech Tagging Methods

- A Survey Sulabh Bhatta, Krunal Parmara and Miral Patelb

aDepartment Of Computer Science, Gujarat University, Ahmedabad, India bDepartment Of I.T., G H Patel College of Engineering and Technology, Vallabh Vidhyanagar, India

ABSTRACT:

This paper presents some great features of Sanskrit- one of the oldest language of the world from Natural

Language Processing (NLP) perspective. Part Of Speech (POS) tagging is the most initial step for developing

any NLP application. POS tagger assigns a tag like noun, verb, pronoun, adjective or other category that

best suits to the word and also the context of the sentence to which it belongs. Moreover, this paper also

provides brief introduction to various approaches and the working of two most famous statistical methods

used for POS tagging: Hidden Markov Model (HMM) and Conditional Random Fields (CRF).

Keywords: NLP, POS tagger, tagset, Sanskrit, Stochastic Tagging, n-gram, HMM, CRF

I. INTRODUCTION

Natural Language Processing (NLP): The aim of Natural Language Processing is to make the computers understand

how humans naturally speak, write, and communicate. The main goal of NLP is to bridge the human-computer gap by

reducing extra efforts needed by a human to communicate with a computer. Major applications of NLP are Machine

translation, Named entity recognition (NER), Part of speech (POS) tagging, Question answering, etc.

Most of the research have been done for English language, but English has ambiguous grammar so we need a

strong and unambiguous grammar which is provided by maharishi Panini in the form of Ashtadhyayi [4].

Sanskrit grammar is the most simplest and ancient grammar in the world. Unlike other languages of the world, there

has never been any kind of change in the structure of Sanskrit grammar.

One of the most important and initial activities in processing a natural language is POS tagging. A POS Tagger

assigns a POS tag to each word of a sentence. The design of the annotated Sanskrit corpus is rapidly going on in India

[5]. Natural languages are basically very complicated and Sanskrit also falls in the same group.

In the development of a POS Tagger, the quality of the POS annotation in a corpus is very much important. A tagged

corpus can be used for many activities. A good tagger can serve as a base for many NLP related activities. It can be

also used for linguistic research. Moreover, it can be used further for stemming in information retrieval (IR), since

knowing a word’s part of speech can help in telling us which morphological affixes it can take.

Automatic POS taggers can be used for developing automatic word-sense disambiguating algorithms; POS tagging

can also be used for developing text to speech related activities.

II. DISCUSSION OF SANSKRIT TAG-SETS

Based on some general principles for tagset design strategy, many POS tagsets have been developed by different

organizations. This section provides a survey of some tagsets developed especially for Sanskrit or for Indian natural

languages and then their evaluation with respect to Sanskrit.

Tags are shown in figures in their respective sections.

A. JNU Sanskrit tagset(JPOS)

This POS tagset is developed by Dr. R. Chandrashekar as a part of his Ph.D. thesis ‘Part-of-Speech Tagging for

Sanskrit’ (2007) [19]. This tagset very strictly follows Paninian grammatical rules. This tagset is fine-grained and

captures linguistic features as per the traditional grammar of Sanskrit language.

This tagset can be classified as per the morphological structure of Sanskrit words. This tagset contains three kinds

of tags. Word class main tags, feature sub-tags and punctuation tags [5].

This tagset is having 134 tags. 65 word class tags, 43 feature sub-tags and 25 punctuation tags and one tag AJ to tag

unknown words [18].

Page 2: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

47

Figure 1. JNU sanskrit tagset (JPOS)[5]

B. IL-POSTS Sanskrit Tagset

This tagset can be derived from IL-POSTS, which is a standard framework for Indian languages developed by

Microsoft Research India Lab (MSRI) [5].

This tagset encodes data at three levels-Categories, Types, and Attributes. As far as we know, tagsets for four

languages have been derived from this framework; Hindi, Bangla, Tamil and Sanskrit [5].

Figure 2. IL-POSTS Sanskrit Tagset categories and their types [5]

Page 3: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

48

Figure 3. IL-POSTS sanskrit tagset attributes and their values [5]

C. Sanskrit Consortium Tagset (CPOS)

This tagset is developed for tagging sandhi-free Sanskrit corpora. This tagset is based on traditional Sanskrit

grammatical categorization.

This tagset has 28 tags. This is intuitively a flat tagset and allows the tagging of major categories only. It seems that,

most of the categories have been adapted from the JPOS tagset [5].

Figure 4. Sanskrit Consortium Tagset (CPOS) [5]

III. POS TAGGING TECHNIQUES

Various methods exist for POS Tagging. The tagging models can be classified into two types: Supervised and

Unsupervised. Both of these techniques differ in regarding the degree of automation of the training process and the

tagging process.

The supervised models require a pre-tagged corpus which can be used for training to explore information regarding

the tagset, tag sequence probabilities, word-tag frequencies and/or rules, etc. Various taggers are existing based on

supervised models.

Page 4: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

49

The unsupervised models do not require previously tagged corpora. Instead, they use advanced computational

techniques to automatically generate tagsets, transformation rules, etc. Using this information, they either generate the

contextual rules needed by rule based or transformation based taggers or calculate the probabilistic information

required by the stochastic POS taggers.

Figure 5. Various models for POS tagging [5]

Both the supervised and unsupervised POS taggers can be further classified into three types:

A. Rule based Tagger

Rule-based taggers use rules, which may be hand-written by experienced linguistic people or derived from a tagged

corpus. Rules can be defined from previous experiences and they help in distinguishing the tag ambiguity.

E.g., Brill tagger (Brill, 1995) [11] is a rule based tagger. It includes lexical rules, used for initialization purpose

and contextual rules, defined to correct the wrong tags.

B. Stochastic Tagger

Stochastic taggers are based on statistics i.e., probability or frequency to tag the input words. The simplest stochastic

taggers solve the issue of ambiguity of words based on the probability that the given word occurs with a particular tag.

In the testing data, the ambiguous instance of a word is assigned the tag that is encountered most frequently in the

training set for the same word. The disadvantage of this method is that it might assign a correct tag to a given word

but it could also yield invalid sequences of tags. One alternative way to the word frequency approach is to calculate

the probability of a given sequence of tags taking place together. This is referred to as n-gram technique, which states

that the best suitable tag to the given word can be determined by the probability that it occurs with the preceding n-1

tags. The stochastic model is based on various models such as n-grams, Maximum Likelihood Estimation, HMM.

1) HMM Based Tagger: HMM-based methods require to evaluate argmax formula. It can be very expensive

because, finding the sequence which maximizes the probability, all possible tag sequences must be checked. Therefore,

a dynamic programming technique known as the Viterbi Algorithm can be used to find optimal tag sequence. Also,

there have been several studies to utilise unsupervised method for training a HMM for POS Tagging. Baum-Welch

algorithm [Baum, 1972] is the most widely known for this, that can be used to train HMM from un-annotated corpus.

A POS tagger based on HMM assigns the best tag to a word by calculating the forward and backward probabilities of

tags along with the sequence provided as an input. The following equation explains this phenomenon [12].

P t w P t t P t t Pw t

Here, P t t is the probability of a current tag given the previous tag and P t t is the probability of the

future tag given the current tag. This captures the transition between the tags. These probabilities are computed using

following equation [12].

Figure 6. Tag transition probability [12]

Each tag transition probability can be computed by calculating the frequency count of two tags seen together in the

corpora divided by the frequency count of the previous tag seen independently in the corpus. This is done because we

know that it is more likely for some tags to precede the other tags.

2) CRF Based Tagger: Charles Sutton et al. (Sutton et al., 2005) formulated CRFs as follows. Let G be a factor

graph over Y. Then p(y|x) is a conditional random field if for any fixed x, the distribution p(y|x) factorizes according

to G. Thus, every conditional distribution p(y|x) is a CRF for some, perhaps trivial, factor graph. If F = {A} is the set

of actors in G, and each factor takes the exponential family form, then the conditional distribution can be written as

[14]:

Page 5: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

50

Figure 7. CRF working formula [14]

X here is a random variable over data sequences to be labeled, and Y is a random variable over corresponding label

sequences. All components Yi of Y are assumed to range over a finite label alphabet Y. For example, X might range

over natural language sentences and Y range over part-of-speech tagging of those sentences, with Y the set of possible

part-of-speech tags. The random variables X and Y are jointly distributed, but in a discriminative framework we

construct a conditional model p(Y|X) from paired observation and label sequences, and do not explicitly model the

marginal p(X).

CRFs define conditional probability distributions P(Y|X) of label sequences given input sequences. Lafferty et al.

defines the probability of a particular label sequence Y given observation sequence X to be a normalized product of

potential functions each of the form [14]:

exp(Σλ j t j (Y i-1 ,Yi ,X , i )+Σμksk(Yi ,X, i ) )

Where tj(Yi-1,Yi,X,i) is a transition feature function of the entire observation sequence and the labels at positions I

and i-1 in the label sequence; sk(Yi,X,i) is a state feature function of the label at position I and the observation

sequence; and λj and μk are parameters to be estimated from training data.

Fj(Y,X)= Σ f j (Yi-1 ,Yi ,X, i )

Where each fj(Yi-1,Yi,X,i) is either a state function s(Yi-1,Yi,X,i) or a transition function t(Yi-1,Yi,X,i). This

allows the probability of a label sequence Y given an observation sequence X to be written as:

P(Y| X , λ ) = (1 /Z(X ) ) exp(Σλ jFj (Y , X ) )

Where Z(X) is a normalization factor.

C. Neural Tagger

Neural taggers are based on neural networks. They learn the parameters of POS tagger from a training data set. The

performance of Neural Taggers can be better than stochastic taggers.

IV. CONCLUSION

Even though Sanskrit is considered as a mother of all Indo-European languages, very less work is done using

stochastic approach for Sanskrit. So, developing NLP applications for Sanskrit using stochastic methods is still an

open area for researchers.

Development of a highly accurate POS tagger for Indian languages is an active research area of NLP. The obstacle

in developing POS tagger for Indian languages including Sanskrit is the none or less availability of large corpora. The

annotated corpora will help researchers to explore the utilization of statistical methods to enhance the existing models

for data analysis.

For a tagger to function as a practical component in a language processing system, a tagger must be Robust,

Efficient, accurate, tunable and reusable. The accuracy of any NLP application depends on the accuracy of a POS

tagger.

ACKNOWLEDGMENT

We are extremely grateful to Mr. Hardik Joshi for their invaluable suggestions and encouragement.

REFERENCES [1] Tapaswi, Namrata, and Suresh Jain. "Treebank based deep grammar acquisition and Part-Of-Speech Tagging for

Sanskrit sentences." Software Engineering (CONSEG), 2012 CSI Sixth International Conference on. IEEE, 2012.

[2] Mishra, Nidhi, and Amit Mishra. "Part of Speech Tagging for Hindi Corpus." Communication Systems and Network Technologies (CSNT), 2011 International Conference on. IEEE, 2011.

[3] Ravindranath, Vaishali. "Sanskrit – the most language suitable for computer linguistics." Proceedings of National level conference CLICK'13 (2013).

[4] Saxena, Shashank, and Raghav Agrawal. "Sanskrit as a Programming Language and Natural Language Processing." Global Journal of Management and Business Studies 3.10 (2013): 1135-1142.

[5] Gopal, Madhav, Diwakar Mishra, and Devi Priyanka Singh. "Evaluating Tagsets for Sanskrit." Sanskrit Computational Linguistics. Springer Berlin Heidelberg, 2010. 150-161.

[6] Huet, Pawan Goyal1 Gérard, and Amba Kulkarni2 Peter Scharf3 Ralph Bunker. "A distributed platform for Sanskrit processing." (2012).

[7] Kumar, Ramakanth. "CURRENT STATE OF THE ART POS TAGGING FOR INDIAN LANGUAGES–A STUDY.”

[8] Kumar, Dinesh, and Gurpreet Singh Josan. "Part of speech taggers for morphologically rich indian languages: a survey." International Journal of Computer Applications 6.5 (2010): 32-41.

[9] Patel, Miral, and Prem Balani. "Clustering Algorithm for Gujarati Language."arXiv preprint arXiv:1307.5393 (2013).

Page 6: Sanskrit Tag-sets and Part-Of-Speech Tagging Methods - A survey

International Journal of Innovative and Emerging Research in Engineering

Volume 2, Issue 1, 2015

51

[10] Prashanthi, Muni, R, Kumar, Sirish, M, Rama Sree R.J. "POS tagger for sanskrit." International Journal of Engineering Sciences Research-IJESR , 2013.

[11] Brill, Eric. "A simple rule-based part of speech tagger." Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992.

[12] Joshi, Nisheeth, Hemant Darbari, and Iti Mathur. "HMM based POS tagger for Hindi." Proceeding of 2013 International Conference on Artificial Intelligence, Soft Computing (AISC-2013). 2013.

[13] Awasthi, Pranjal, Delip Rao, and Balaraman Ravindran. "Part of Speech Tagging and Chunking with HMM and CRF." Proceedings of NLP Association of India (NLPAI) Machine Learning Contest 2006 (2006).

[14] Husain, Mohd Shahid. "An Unsupervised Approach to Develop a Stemmer."International Journal on Natural Language Computing (IJNLC) Vol 1.2 (2012): 15-23.

[15] Singh, Jyoti, Nisheeth Joshi, and Iti Mathur. "Part of Speech Tagging of Marathi Text Using Trigram Method." arXiv preprint arXiv:1307.4299 (2013).

[16] Hasan, Muhammad Fahim, Naushad UzZaman, and Mumit Khan. "Comparison of Unigram, Bigram, HMM and Brill's POS tagging approaches for some South Asian languages." (2007).

[17] http://en.wikipedia.org/wiki/Natural_language_processing

[18] http://sanskrit.jnu.ac.in/corpora/JNU-Sanskrit-Tagset.htm

[19] Chandrashekar, R. Parts-of-Speech Tagging For Sanskrit. Diss. Ph. D. thesis submitted to JNU, New Delhi, 2007.

[20] Patel, Chirag, and Karthik Gali. "Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields." IJCNLP. 2008.

[21] Kudo, Taku. "CRF++: Yet another CRF toolkit." Software available at http://crfpp. sourceforge. net (2005).