pos tagger and chunker for tamil
DESCRIPTION
POS Tagger and Chunker for Tamil. Guided by Dr.K.P.Soman Head, CEN Amrita University. Dr.S.Rajendaran Head, Dept.Linguistics Tamil University. Presented by V.Dhanalakshmi M.Anand Kumar CEN, Amrita. 2. Overview. Introduction Tamil POS Tagging AMRITA Tagset Tamil POS Tagging - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/1.jpg)
![Page 2: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/2.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
2
POS Tagger and Chunker for Tamil
Guided byDr.K.P.SomanHead, CENAmrita University.
Dr.S.RajendaranHead, Dept.LinguisticsTamil University.
Presented by
V.Dhanalakshmi
M.Anand Kumar
CEN, Amrita.
2
![Page 3: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/3.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
3
Overview Introduction Tamil POS Tagging AMRITA Tagset Tamil POS Tagging SVMTool Chunking Yamcha Results Conclusion
3
![Page 4: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/4.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
4
Introduction Part-of-speech (POS) tagging , also called
grammatical tagging, is the process of assigning POS tags to each and every word in a sentence.
It is like assigning the grammatical category such as Noun, Verb, Adjective, Adverb etc .
The next process after POS tagging is chunking, which divides sentences into non recursive inseparable Phrases.
i.e. only one head in a phrase.
![Page 5: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/5.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
5
Introduction There are many Tools available for POS
tagging and Chunking. We have used SVM based Tools for Tamil POS
tagging and Chunking.
SVMTOOL POS Tagging YAMCHA Chunking
![Page 6: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/6.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
6
Introduction POS tagging and Chunking is considered as
an important process in speech recognition, natural language parsing, information retrieval and machine translation.
Here POS Tagging problem is converted into classification problem.
![Page 7: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/7.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
7
POS Tagging INPUT: a string of words (sentence)
OUTPUT: a single best tag for each word (POS Tagged sentence)
7
![Page 8: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/8.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
8
Example of Tamil POS Tagging
Assigning the words grammatical category in a sentence .
< Six feet tall bell is in the temple>
![Page 9: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/9.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
9
Example of POS Tagging
NN CRD NN ADJ NN VF
<Six feet tall bell is in the temple>
![Page 10: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/10.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
10
LEXICAL AMBIGUITY IN TAMIL.
Assign POS tags to words in a sentence considering its lexical ambiguity.
NN NN NN ADJ NN VF NN CRD VF ADJ NNP VF
<Six feet tall bell is in the temple>
![Page 11: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/11.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
11
POS Tagging Example
Assigning the words grammatical category considering its lexical ambiguity.
NN NN NN ADJ NN VF NN CRD VF ADJ NNP VF (Ambiguity tags)
Six feet tall bell is in the temple.
![Page 12: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/12.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
12
COMPLEXITY IN TAMIL POS TAGGING
Tamil is a morphologically rich agglutinative language.
Mostly we depend on syntactic function or context to decide upon whether one word is a noun or adjective or adverb or post position. Example:
<varum> can be <VF> OR <VNAJ> This leads to the complexity of Tamil in POS
tagging.
12
![Page 13: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/13.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
1313
AMRITA TAGSET
Considering the Lexical ambiguities and syntactical complexities, we have created a new tag set <AMRITA tagset> to tag our corpus for SVM based POS Tagger for Tamil.
![Page 14: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/14.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
14
AMRITA TAGSET We considered the guidelines from “Annotating Corpora
Guidelines For POS And Chunk Annotation For Indian Languages [IIIT, Hyderabad] ” while creating our AMRITA Tagset:
1. The tags should be simple. 2. Maintaining simplicity for Ease of Learning and
Consistency in annotation. 3. POS tagging is not a replacement for morph
analyzer. 4. A 'word' in a text carries grammatical category and
grammatical features such as case, tense, person, number, gender, etc. The POS tag should be based on the 'category' of the word and the features can be acquired from the morph analyzer.
![Page 15: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/15.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
15
AMRITA Tagset
Tagset is simple. It is based on the 'category' of the
word, does not considers the grammatical features of the word.
Tagset size: 32 Tags
15
![Page 16: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/16.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
16
AMRITA Tag set for Tamil
16
![Page 17: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/17.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
17
Corpus development : We have developed our corpus of 2.50 LAKHS words,
collecting corpora from Dinamani newspaper, yahoo tamil news, That’s Tamil, online Tamil short stories etc.
Three stages in corpus development Pre-editing Manual Tagging Tagging using SVMTagger
Corpus size: 2.50 lakhs words
![Page 18: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/18.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
18
SVM(Support Vector Machine) Support vector machine is a training
algorithm for learning classification and regression rules from data.
SVM is based on the idea of structural risk minimization, a principled technique for selecting a model which minimizes generalization error.
SVM is increasingly being used in processing NLP tasks
![Page 19: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/19.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
19
SVMTool This implementation is based on the principle
of Support Vector Machines (SVM).
This Tool is developed by Jes´us Gim´enez and Llu´ıs M`arquez.
Trains efficiently and solve real NLP problems like POS tagging
SVMTool is freely available athttp://www.lsi.upc.es/~nlp/SVMTool
![Page 20: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/20.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
20
Training Data Format …….
இந்த <DET> ஆண்டில் <NN>
3500 <CRD> பஸ்கள் <NN>
வா ங்கப்படும்<VF>. <DOT>
இத�ல் <PRP> செ�ன்னை� <NNP>
…..
![Page 21: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/21.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
21
Tagger Implementation
Corpus
Tokenization
Training
SVMTagger
Tagged wordsUnTagged words
Tagging
![Page 22: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/22.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
22
CHUNKING A subsequent step after tagging focuses on the
identification of basic structural relations between groups of words. This is usually referred to as phrase chunking.
Input: Word sequence and POS tags
Output : A single best Chunk Tag for each word along with its POS tag.
![Page 23: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/23.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
23
Chunking in Tamil Tamil being an agglutinative language have a
complex morphological and syntactical structure.
It is a relatively free word order language but in the phrasal and clausal construction it behaves like a fixed word order language.
The process of chunking in Tamil is less complex compared to the process of POS tagging.
![Page 24: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/24.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
24
EXAMPLEAssigning Chunk Tags to words in a sentences.
B-NP B-NP I-NP B-NP I-NP B-VP
![Page 25: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/25.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
25
Chunk tagsetS.No
Chunk Tag
Tag Name Possible POS Tags
1 NP Noun Phrase NN,NNP,NNPC,NNC,NNQ,PRP,QTF,DET,CRD,ORD,ADJ,INT
2 AJP Adjectival Phrase CRD, ADJ
3 AVP Adverbial Phrase ADV,INT,CRD
4 VFP Verb Finite Phrase VF,VAX
5 VNP Verb Nonfinite Phrase
VNAJ,VNAV,VINT,CVB
6 VGP Verb Gerund Phrase VBG
7 CJP Conjunctional CNJ
8 COMP Complimentizer COM
9 . ? Symbols O
![Page 26: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/26.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
26
Chunk Tagset
IOB Tag: The IOB tags are used to indicate the
boundaries for each chunk B – the current word is the beginning of a
chunk, which may be followed by another chunk.
O - indicates the boundary of the sentence.
I – the current word is inside a chunk.
![Page 27: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/27.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
27
Yamcha
YamCha is a generic, customizable, and open source text chunker.
YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.
![Page 28: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/28.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
28
TRAINING AND TEST FILE FORMAT
Both the training file and the test file need to be in a particular format for Yamcha to work properly.
The training and test file must consist of multiple tokens.
A token consists of multiple (but fixed-numbers) columns. The tokens are simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, an empty line is put.
![Page 29: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/29.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
29
TRAINING AND TEST FILE FORMAT
We can give as many columns as we like, however the number of columns must be fixed through all tokens.
There are some kinds of "semantics" among the columns. For example, First column is 'word', second column is 'POS tag' third column is ‘CHUNK tag' and so on.
The last column represents a true answer tag which is going to be trained by Yamcha.
![Page 30: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/30.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
30
Training data - sample
![Page 31: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/31.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
31
C E N Amrita Vishwa Vidyapeetham Coimbatore.
31
Tagger Implementation
POS TAGGED Corpus
Yamcha Training
Trained Model
Chunked outputPOS Tagged Input
Manual Tagging
![Page 32: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/32.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
32
![Page 33: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/33.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
33
CONCLUSION Chunking plays an important role in various
Natural language processing applications. Chunked corpus can be used for parsing
which will provide important syntactic information for machine translation.
Future possible work is to increase the corpus size i.e. To build Annotated corpus for Tamil.
![Page 34: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/34.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
34
REFERENCES Gim´enez, J. and L.M`arquez. “Fast and Accurate Part-of-
Speech Tagging”: The SVM Approach Revisited”. In Proceedings of the Fourth RANLP, 2003.
Rajendran S, “ Parsing in tamil -Present state of art”, language in india, Volume 6 : 8-th August 2006
Abney S, “Parsing by Chunks”, Principle-based parsing. Kluwer Academic Publishers, Dordrecht, pp 257-278, 1991.
Sobha L, Vijay Sundar Ram R. “Noun Phrase Chunking in Tamil”, In proceeding of the MSPIL-06, Indian Institute of Technology,Bombay.pp-194-198.
Taku Kudo, 2003. CRF++:Yet Another CRFToolkit. http://chasen.org/~taku/software/CRF++/.
![Page 35: POS Tagger and Chunker for Tamil](https://reader035.vdocuments.us/reader035/viewer/2022062221/56814523550346895db1e6f1/html5/thumbnails/35.jpg)
C E N Amrita Vishwa Vidyapeetham Coimbatore.
35
நன்றி�
THANK YOU
35