learning morphological disambiguation rules for turkish
DESCRIPTION
Learning Morphological Disambiguation Rules for Turkish. Deniz Yuret Ferhan T ü re Ko ç University, İ stanbul. Overview. Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation. Turkish Morphology. - PowerPoint PPT PresentationTRANSCRIPT
Learning Morphological Disambiguation Rules for Turkish
Deniz Yuret
Ferhan Türe
Koç University, İstanbul
Overview
Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation
Turkish Morphology
Turkish is an agglutinative language: Many syntactic phenomena expressed by function words and word order in
English are expressed by morphology in Turkish.
I will be able to go.
(go) + (able to) + (will) + (I)
git + ebil + ecek + im
Gidebileceğim.
Fun with Turkish Morphology
Avrupa Europe lı European laş become tır make ama not able to
dık we were larımız those that dan from mış were sınız you
Avrupalılaştıramadıklarımızdanmışsınız
So how long can words be?
uyu – sleep uyut – make X sleep uyuttur – have Y make X sleep uyutturt – have Z have Y make X sleep uyutturttur – have W have Z have Y make X sleep uyutturtturt – have Q have W have Z … …
Morphological Analyzer for Turkish
masalı masal+Noun+A3sg+Pnon+Acc (= the story) masal+Noun+A3sg+P3sg+Nom (= his story) masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (=
with tables)
Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing
Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999) Design for a turkish treebank. EACL’99
Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003
Features, IGs and Tags
126 unique features 9129 unique IGs
∞ unique tags 11084 distinct tags observed in
1M word training corpus
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
stemfeatures features
inflectional group (IG) IGderivationalboundary
tag
Why not just do POS tagging?
from Oflazer (1999)
Why not just do POS tagging?
Inflectional groups can independently act as heads or modifiers in syntactic dependencies.
Full morphological analysis is essential for further syntactic analysis.
Morphological disambiguation
Ambiguity rare in English: lives = live+s or life+s
More serious in Turkish:42.1% of the tokens ambiguous
1.8 parses per token on average
3.8 parses for ambiguous tokens
Morphological disambiguation
Task: pick correct parse given context1. masal+Noun+A3sg+Pnon+Acc
2. masal+Noun+A3sg+P3sg+Nom
3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
– Uzun masalı anlat Tell the long story– Uzun masalı bitti His long story ended– Uzun masalı oda Room with long table
Morphological disambiguation
Task: pick correct parse given context1. masal+Noun+A3sg+Pnon+Acc
2. masal+Noun+A3sg+P3sg+Nom
3. masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
Key Idea
Build a separate classifier for each feature.
Decision Lists
1. If (W = çok) and (R1 = +DA)
Then W has +Det
2. If (L1 = pek)
Then W has +Det
3. If (W = +AzI)
Then W does not have +Det
4. If (W = çok)
Then W does not have +Det
5. If TRUE
Then W has +Det
“pek çok alanda”(R1)
“pek çok insan” (R2) “insan çok daha”
(R4)
Greedy Prepend Algorithm
GPA(data)1 dlist = NIL2 default-class = Most-Common-Class(data)3 rule = [If TRUE Then default-class]4 while Gain(rule, dlist, data) > 05 do dlist = prepend(rule, dlist)6 rule = Max-Gain-Rule(dlist, data)7 return dlist
Training Data
1M words of news material Semi automatically disambiguated Created 126 separate training sets, one for
each feature Each training set only contains instances
which have the corresponding feature in at least one of their parses
Input attributes
For a five word window: The exact word string (e.g. W=Ali'nin) The lowercase version (e.g. W=ali'nin) All suffixes (e.g. W=+n, W=+In, W=+nIn,
W=+'nIn, etc.) Character types (e.g. Ali'nin would be
described with W=UPPER-FIRST, W=LOWER-MID,
W=APOS-MID, W=LOWERLAST)
Average 40 features per instance.
Sample decision lists
+Acc01 W=+InI1 W=+yI1 W=UPPER01 W=+IzI1 L1=~bu1 W=~onu1 R1=+mAK1 W=~beni0 W=~günü1 W=+InlArI1 W=~onlarý0 W=+olAyI0 W=~sorunu… (672 rules)
+Prop10 W=STFIRST0 W==Türk1 W=STFIRST R1=UCFIRST0 L1==.0 W=+AnAl1 R1==,0 W=+yAD1 W=UPPER00 W=+lAD0 W=+AK1 R1=UPPER0 W==Milli1 W=STFIRST R1=UPPER0… (3476 rules)
Models for individual features
0
1000
2000
3000
4000
5000
6000
7000
A3sgNoun
PnonNom DB
Verb
AdjPos
P3sgP2s
gPro
pZer
oAcc
Adverb
A3pl
Ru
les
84
86
88
90
92
94
96
98
100
Acc
ura
cy
Combining models
masal+Noun+A3sg+P3sg+Nom masal+Noun+A3sg+Pnon+Acc Decision list results and confidence (only
distinguishing features necessary): P3sg = yes (89.53%) Nom = no (93.92%) Pnon = no (95.03%) Acc = yes (89.24%)
score(P3sg+Nom) = 0.8953 x (1 – 0.9392) score(Pnon+Acc) = (1 – 0.9503) x 0.8924
Evaluation
Test corpus: 1000 words, hand tagged Accuracy: 95.87% (conf. int: 94.57-97.08) Better than the training data !?
Other Experiments
Retraining on own output: 96.03% Training on unambiguous data: 82.57% Forget disambiguation, let’s do tagging with a
single decision list: 91.23%, 10000 rules
Contributions
Learning morphological disambiguation rules using GPA decision list learner.
Reducing data sparseness and increase noise tolerance using separate models for individual output features.
ECOC, WSD, etc.