turkish nomina
DESCRIPTION
Turkish NominaTRANSCRIPT
0
Finite State Morphology:
The Turkish Nominal Paradigm
A Thesis by
Philip Makedonski
Submited to
Seminar für Sprachwissenschaft
Eberhard Karls Universität Tübingen,
72074 Tübingen, Germany
In fulfillment of the requirements for the degree
Bachelor of Arts in Computational Linguistics
July 2005
1
ABSTRACT
Finite State Morphology: The Turkish Nominal Paradigm
Makedonski, Philip
Seminar für Sprachwissenschaft
Eberhard Karls Universität Tübingen
Supervisor: Dr. Dale Gerdemann
July 2005
24 Pages
In this thesis my goal is to present a finite state approach to the inflectional morphology of
Turkish nouns, the ultimate goal being building a morphological analyzer for Turkish nouns.
We’ll be dealing primarily with the principles of vowel harmony across the different
inflectional noun suffixes in Turkish as the most interesting phenomenon and my
implementation of these principles in the Xerox Finite State Toolbox (xFST). We will also
pay attention to the other morphophonological alternations occurring both in the stem and the
suffixes attached to it as a result of the inflectional processes.
Keywords: Natural Language Processing, Finite State Networks, Turkish
Morphology, Computational Linguistics
2
To my family, to my love
3
ACKNOWLEDGEMENTS
First, I’d like to thank my supervisor Dr. Dale Gerdemann for his support and
advisory over this project. I appreciate the freedom and independence I had for
the choice of topic and approach.
I would also like to thank Dr. Sandra Kübler for her support and understanding
throughout this course of studies, which in many cases turned out to be the
crucial for my progress.
Many, many thanks to my family for their support all the time, no matter what
was happening. Thanks to my friends for their understanding.
And most of all, special thanks to Nevin Recep for sparkling my interest in the
Turkish language and supporting me all the time.
4
TABLE OF CONTENTS
ABSTRACT........................................................................................................................................................... 1
DEDICATION....................................................................................................................................................... 2
ACKNOWLEDGEMENTS.................................................................................................................................. 3
TABLE OF CONTENTS...................................................................................................................................... 4
1. INTRODUCTION ....................................................................................................................................... 5
1.1 MOTIVATION ....................................................................................................................................... 5 1.2 MORPHOLOGY..................................................................................................................................... 5 1.3 RELATED WORK ................................................................................................................................. 6 1.4 OVERVIEW........................................................................................................................................... 7
2. BACKGROUND.......................................................................................................................................... 7
2.1 TURKISH .............................................................................................................................................. 7 2.2 FINITE STATE TECHNOLOGY ............................................................................................................ 10
2.2.1 Finite State Automata (FSA) ....................................................................................................... 10 2.2.2 Finite State Transducers (FST’s) ................................................................................................ 11
2.3 XFST ................................................................................................................................................. 12
3. THE MODEL............................................................................................................................................. 13
3.1 THE NOMINAL PARADIGM OF TURKISH. MORPHOTACTICS............................................................ 13 3.1.1 Inflection for Number .................................................................................................................. 14 3.1.2 Case Inflection.............................................................................................................................. 14 3.1.3 Inflection for Possession .............................................................................................................. 15 3.1.4 Lexical Exceptions – the ‘su’ case ............................................................................................... 17
3.2 PHONOLOGICAL ALTERNATION RULES ........................................................................................... 17 3.2.1 Resolving Vowel Harmony........................................................................................................... 17 3.2.2 Consonant Alternation Rules....................................................................................................... 19
3.2.2.1 Final Consonat (De)Voicing ....................................................................................................................... 19 3.2.2.2 (De)Gemination ........................................................................................................................................... 20
3.2.3 Other Alternations........................................................................................................................ 20 3.2.3.1 Vowel Insertion/Deletion ............................................................................................................................. 21 3.2.3.1 The Glottal Stop ........................................................................................................................................... 21
3.3 IMPLEMENTATION............................................................................................................................. 22 3.3.1 The Lexicon .................................................................................................................................. 22 3.3.2 The Rules Component .................................................................................................................. 24
3.3.2.1 Vowel Harmony Rules ................................................................................................................................. 25 3.3.2.2 Consonant Alternation Rules ...................................................................................................................... 26 3.3.2.3 Fixing the Morphotactics ............................................................................................................................ 27 3.3.2.4 Rule Order ................................................................................................................................................... 28
4. CONCLUSIONS........................................................................................................................................ 29
5. FUTURE WORK....................................................................................................................................... 29
APPENDIX A: LIST OF ABBREVIATIONS.................................................................................................. 31
APPENDIX B: LEXC CODE SAMPLES......................................................................................................... 32
APPENDIX C: ON REPLACEMENT RULES................................................................................................ 33
5
1. Introduction
1.1 Motivation
In morphologically rich languages like Bulgarian, Turkish, Russian, Spanish and many others,
grammatical features and functions typically assigned to the syntactic structure in
morphologically poor languages like English, are often represented in the morphological
structure. As a consequence, any form of an adequate Natural Language Processing (NLP)
application would require a good morphological component due to the increased role of
morphology in these languages. This in turn would require a rich lexicon, and building up a
lexicon, explicitly listing all the possible forms as separate entries, would quickly explode into
an unmanageable size due to the rich inflectional and derivational possibilities for a single
base (dictionary) form (stem). In Turkish for example, the nominal inflectional paradigm has
three basic types of suffixes – for number, possession and case (the number varies in the
different sources), and the verbal inflectional paradigm is even more complicated with its
eight affixes (again, the number might be different depending on the source). There are
approximately 20.000 stems and 300-400 roots actively used in Turkish, which effectively
amount to millions of inflected and derived forms. This further increases the demand for an
automated morphological analysis.
As it turns out, morphological structures are much more regular than syntactic ones. They can
be handled very efficiently and accurately using sets of rules and compact lexicons of base
forms (stems). Furthermore, important semantic and grammatical information could be
encoded in such lexicons as well.
1.2 Morphology
The central concepts of morphology are morphotactics and (morpho)phonological
alternations. Morphotactics (also morphosyntax or word formation) defines the constraints on
possible morpheme combinations. Phonological (also orthographical) alternations define the
changes in morphemes occurring in particular environments. To illustrate the issue an
example from Karttunen (Karttunen, 2003) comes at hand:
(1) pity → piti-less → piti-less-ness
(Karttunen, 2003, pp. XVI)
Morphotactic definition accounts for the acceptability of a word like piti-less-ness and the
unacceptability of a word like *piti-ness-less. Phonological alternations on the other side
describe why pity is realized as piti in the context of a following less. These are simple
examples that could be caught easily with a few basic rules. But for a full scale NLP, one
needs a much more sophisticated system. This is especially valid for agglutinative languages
like Turkish where the concept of a word is much wider. Different relations between the
words in a sentence are mostly expressed by affixes. Furthermore, many affixes and roots in
Turkish change their shape depending on the environment and have to obey various
constraints like vowel harmony.
6
1.3 Related Work
A significant amount of work has been done in the computational modeling of Turkish
morphology already: Köksal’s first approach to a computerized model for automatic
morphological analysis of Turkish (Köksal, 1975); Hankamer’s description in terms of finite
state morphology (Hankamer, 1986); numerous recent works by Kemal Oflazer based on his
Two-Level model of Turkish (Oflazer, 1994); Schaaik’s Studies in Turkish Grammar
(Schaaik, 1996) is a comprehensive guide to building a computational model for full nominal
phrases using the functional grammar formalism (Dik, 1981). For the earlier works are hard to
find, I will briefly discuss only the more recent works by Oflazer and Schaaik as closely
related to what I am doing in this project.
Oflazer’s work is based primarily on his two-level model for Turkish morphology (Oflazer,
1994). The idea behind the two-level models originates from Koskenniemi (Koskenniemi,
1983). The most significant difference from the ordered linear approach in composed
sequences of rule transducers1 is that all the rules operate in parallel. To illustrate the
difference, a basic two-level model and a cascade-based model relating the languages defining
the lexical and surface forms are presented in Figure 1.1 below:
In the cascade-model of composed rule transducers, each transducer operates on its own input
and output, producing an intermediate output to feed the next transducer in the cascade. With
the key concept here being “feed”, the major drawback of the two-level models has been that
in the case of bleeding or feeding relations between rules (which is often the case in
generative phonology), it is hardly possible to define such relations within this approach
1 More on transducers and automata follows in the technical background on finite state technology in Section
2.2. For now think of rule transducers simply as a way to implement rules.
FST n
FST 2
FST 1
Lexical Form
Surface Form
…
Intermediate
forms
Figure 1.1: Cascade-based and two-level (parallel) models in finite state
morphology.
FST n FST 2 FST 1
Lexical Form
Surface Form
…
7
(apart from having to design the rules very carefully in order to get the necessary result). But
the convenience of the cascade-based model from this perspective comes at a price. In the
process of composition, the network could easily “explode” into unmanageable size as many
parts of it may need to be copied. Luckily there are some techniques to restrict such growth.
My project combines both models in a way as we shall see later. The advantage being,
whenever parallel operation of rules is needed, we’ll use one, and whenever sequential
(linear) operation of rules is needed, such will be used.
1.4 Overview
In the following sections I will present a finite state approach to a part of the Turkish
morphology. I will focus on the nominal morphology only, in particular the different
inflectional paradigms, as the complete nominal morphology of Turkish is a subject too broad
to cover here (set aside the complete Turkish morphology). Once a solution for the nominal
morphology is designed however, it could be easily extended to cover the other major word
classes in a language. I will try to approach the task as modular as possible, so that if changes
or extensions are required, all that is needed is to plug in the extension component and
occasionally do a little tune up of the system. The key concept here is modularity.
My work is based primarily on Geoffrey Lewis’ Turkish (Lewis, 1989) and Turkish Grammar
(Lewis, 1967), referred to as the official language guides for Turkish in most papers. For the
purpose of this project I will be using the Xerox Finite State Toolbox (XFST) and the
“manual” to it by Lauri Karttunen (Karttunen, 2003).
In Section 2 I will roughly present the background information needed to proceed through the
paper as follows: Section 2.1 – linguistic background on Turkish; Sections 2.2 and 2.3 provide
some technical background on the technology employed and the particular toolbox I have
chosen to use. The actual model and its implementation will be presented in their full beauty
in Section 3. We conclude in Section 4 and in Section 5 I will present an outlook on possible
future elaborations.
2. Background In the following sections I will present the basic “technical” properties of the language and the
technology used to model it.
2.1 Turkish
In this subsection I will present the most important features of Turkish that we’ll be dealing
with in the subsequent sections.
Turkish is an agglutinative language from the family of Turkic languages. A Turkish word
consists of a root (base form) and a number of suffixes attached to it, each extending its
meaning or changing its word class:
8
(2) bilgi – knowledge
biglisiz – without knowledge
bilgisizlik – lack of knowledge
bilgisizlikleri – their lack of knowledge
bilgisizliklerinden – from their lack of knowledge
bilgisizliklerindenmiş – I gather that it was from their lack of knowledge
(Lewis, 1989. pp. 3)
As one might infer, many ideas typically expressed by prepositions or pronouns across
languages are expressed by suffixes in Turkish.
Another important feature of the Turkish language is vowel harmony. Vowel harmony is
basically described as a “progressive sound assimilation” phenomenon. In simple words, the
features of a vowel depend on the features of the preceding vowel. We’ll be dealing
exclusively with the vowel harmony of suffixes in Turkish and as mentioned before, the scope
of this project will be restricted to inflectional noun suffixes only.
Geoffrey Lewis (Lewis, 1989) describes the vowel harmony in Turkish with a general law of
vowel harmony in terms of the feature +/-back of vowels. The Turkish vowel system is
shown in table 2.1 below:
Unrounded Rounded
Low High Low High
Front a ı o u
Back e i ö ü
Table 2.1: The vowel system of Turkish.
As stated in (Lewis, 1989), all the vowels in a word agree with the backness value of the first
vowel of that word:
(3) +Back -Back
sekiz – eight dokuz – nine
seksen – eighty doksan – ninety
sinir – nerve sınır – frontier
sinirler – nerves sınırlar – frontiers
sinirlerimiz – our nerves sınırlarımız – our frontiers
(Lewis, 1989. pp. 11)
In cases of disharmony
1 in the root or if an invariable suffix is attached, the harmonic suffixes
harmonize with the vowel of the last preceding syllable. So attaching the plural suffix -ler/
-lar, which harmonizes for backness, to anne (mother) will result in anneler (mothers) and
not in *annelar, harmonizing with the vowel of the first syllable.
1 Exceptions to this principle are: a small number of native Turkish words – elma (apple), anne (mother),
kardeş (brother or sister); eight invariable suffixes; compound words – bilgisayar (computer), from bilgi
(information) and sayar (counter, lister); loanwords. Clements and Sezer account for them in (Clements, 1982)
9
There is also, as Lewis (Lewis, 1989) refers to it, a “special law of vowel harmony”, that
constrains the occurrence of vowels in terms of roundedness1. Unrounded vowels are typically
followed by unrounded vowels and rounded vowels are typically followed by low unrounded
or high rounded vowels.
Combining the two principles we end up with the following:
(4) a is followed by a or ı
e is followed by e or i
ı is followed by a or ı
i is followed by e or i
o is followed by a or u
ö is followed by e or ü
u is followed by a or u
ü is followed by e or ü
Turkish suffixes, except the eight invariable ones, harmonize with, for the sake of simplicity,
the vowel of the last syllable of the word they are attached to. They could be divided in two
groups: The vowels of the first group alternate between the low unrounded vowels a and e
(also called e-type2 suffixes (Pollard, 1996)) and the vowels of the second group alternate
between the high vowels ı, i, u and ü (the so-called i-type1 suffixes (Pollard, 1996)). Except
one – the present tense verbal suffix –iyor/–ıyor/–uyor/–üyor, no other suffixes contain o
and ö. (4) above provides some basic notion about this classification. The plural suffix -ler/
-lar falls in the first class, whereas suffix like the definite objective case suffix is an i-type
suffix.
(5) ev (house) evler (houses) evi (the house)
kol (arm) kollar (arms) kolu (the arm)
kitap (book) kitaplar (books) kitabı (the book)
köprü (bridge) köprüler (bridges) köprüyü (the bridge)
One might notice a few addtional things from (5). First of all no vowel sequences are possible
in Turkish. Exceptions are some loan words like saat (hour). Typically a buffer y is inserted if
a suffix begining with a vowel is attached to a word ending in a vowel. In some cases it is a n
or an s. Second, words in Turkish typically end in voiceless consonants, but they do change to
voiced ones intervocally. This topic, allong with the other alternations occuring in the process
of suffixation will be further elaborated in Section 3.2.2.
These are the general morphological and phonological features of Turkish that we will pay
attention to. In Section 3.1 and 3.2 I will present the actual morphotactics of the Turkish
nominal inflectional paradigm and the phonological alternation rules respectively.
1 Exceptions to this principle will be: tapu (title-deed), avuç (hollow of the hand), abuk sabuk (nonsensical),
çamur (mud) – in general a can be followed by u if a p, v, b or m intervenes. These exceptions occur apparently
only root-internally and do not seem to affect suffixation: kitap (book) → kitabı – (book, definite objective case
– the book). 2 The e-/i-type distinction is really a distinction between harmonizing vowels and not suffixes as Pollard
(Pollard, 1996) proposes. Some suffixes like the 3pPl Poss. –leri/-ları feature both types of harmonizing vowels.
10
2.2 Finite State Technology
Finite state technology was quickly condemned by the linguists at the earlier stages of its
development due to its weak descriptive power. But later on it proved to be quite useful for
modeling parts of languages that could be considered finite and regular. Various tasks are
nowadays approached using finite state technology – part-of-speech disambiguation,
tokenization, shallow parsing.. But the most significant and core application of finite state
technology in NLP remains morphological analysis. It is the basis for any further kind of
natural language processing.
The basic idea behind finite state technology is a set of states, with different properties and set
of arcs that connect these states. Arcs have a direction and an input symbol. That is, for a
particular state there is a set of outgoing arcs with their respective input symbols. The states
and arcs together form networks1.
2.2.1 Finite State Automata (FSA)
Finite state networks typically have one start
state and one or more final states. Transitions
between the states are possible only if the
required input is recognized. The sequence of
transitions over arcs to a particular state is
called a path. In the above example there are
two paths possible to the final state 3. In
order to accept a string, at the end of the
input the network should be in a final state.
Valid inputs for the network in Figure 2.1 are
b and ab, but not a by itself.
For the slightly more complicated network in
Figure 2.2, valid input sequences will be: b,
ab, bcb, abcb, bcab, abcab… Because of the
looping arc through c, we end up with an
infinite set of acceptable input strings. All
the possible input strings in this case seem to
follow a particular (regular) pattern.
Enumerating all the inputs seems
unreasonable. We’d rather define some rule
that selects valid inputs. A more compact
representation could be defined using
regular expressions. A regular expression
(or a regex) is a pattern that matches a set of
strings which obey particular syntax rules. It
1 We will be talking about networks here as a general term abstracting over transducers and automata. Automata
are finite state machines that only accept a set of given strings (a language), whereas transducers provide a set of
outputs for an accepted input, which might as well be identical to the input. Automata describe languages,
whereas transducers express relations between languages.
b
b 2 1
3
Figure 2.1: A simple three-state network. The
state marked with and arrow (1) is the start state,
the state marked with a double circle (3) is the
final state.
a
b c
b 2 1
3
Figure 2.2: A bit more complicated three-state
model. The arc with input c takes us back to the
start state creating a loop.
a
11
is an essential concept in Finite State Technology. Regular expressions describe the languages
accepted by Finite State Automata – the regular languages. In the current state, regular
expressions are only partially related to real regular expressions. There are newer operations
defined in every particular toolbox, extending its capabilities and expressive power. The
precise syntax varies among applications and toolboxes. I will describe the necessary syntax
basics in further detail, in terms of the toolbox I am using in section 2.3. A model solution for
the above networks using the lexc language is provided in the appendix.
2.2.2 Finite State Transducers (FST’s)
A Finite State Network (or a Finite State Machine), as noted above, is the general term for
Finite State Automata (FSA) and Finite State Transducers (FST’s). Where FSA deal with
acceptance/recognition only, FST’s also provide output(s) for the recognized input. This
major difference is described using symbol pairs in the model in Figure 2.3 below:
For an input string like ab the output will be AB, for abcb – ABcB, and so on. It seems like a
simple replacement operation, but there is no such operation involved here. In this case we
have strings from one language (later on referred to as the ‘UPPER’ language1) related to
strings from another language (which will be called the ‘LOWER’ language1). The c which
remains unchanged is applied the identity relation.
These are the basics. Once we have designed a network describing a language or a relation,
we can apply different operations to it – intersection (&)2, union (|), concatenatenation ( ),
negation (~), subtraction (-), composition (.o.), etc. The essential terms will be explained as
needed as we proceed. Most important to note here is the composition operation (.o.). A
general feature of Finite State Networks is that they can be composed together yielding a
sequence of transducers/ automata – a modular structure that is very essential to our purpose
in this paper. Composition is an operation on two relations. Say we have the transducer above
(Figure 2.3) that is turning lowercase a’s and b’s into upper case A’s and B’s respectively.
This could be further described as <a,A> and <b,B> in terms of relations. Say we have then
another transducer that is turning capital A’s and B’s into numbers, <A,1> and <B,2>.
Composing the two of them would provide us with a new transducer taking the upper side of
the first and the lower side of the second transducer, where the inner symbols match:
(6) [<a,A>, <b,B> ] .o. [<A,1>, <B,2>.] → [<a,1>, <b,2>.]
1 The terms will be explained in more detail in section 2.3 2 The operators and their syntax vary among toolboxes. I will be using the ones described in (Karttunen, 2003)
b:B c
b:B 2 1
3
Figure 2.3: A Finite State Transducer. It accepts the same strings as the FSA in Figure 2.2, but transforms
the lowercase a’s and b’s into upper-case A’s and B’s respectively. The c’s remain unchanged.
a:A
12
All the operations can be applied multiple times to different networks. For some of them the
order matters, for others not. Composition allows us to build a cascade of multiple transducers
into a single transducer, in terms of the current task at hand, compose multiple rule
transducers into a single lexical transducer that is relating strings from the language of surface
forms to strings from the language of lexical (underlying) forms. It was C.D. Johnson
(Johnson, 1972) who first realized that morphophonological knowledge could be modeled
using FSN’s. The most fascinating part is, once we have constructed a transducer for
morphological generation, we can easily apply it in the other direction for the task of
morphological analysis. This natural feature of finite state networks is what makes them so
suitable for morphological processing.
I will spare the mathematical model behind Finite State Networks, as it won’t be necessary to
understand the current paper. For further information on finite state technology and automata
theory refer to (Hopcroft, 1979).
2.3 XFST
The Xerox Finite State Toolbox (XFST) was developed at the Xerox Research Centre Europe
(XRCE) by Kenneth R. Beesley and Lauri Karttunen. It implements the standard finite state
operations such as composition and union as well as several innovative operations like
replacement rules1 and local sequentialization. XFST includes: lexc - a complier for lexicons
in the lexc language, which is specifically designed for handling morphotactics in natural
languages, and xfst – the core tool providing interface to the finite state calculus for building,
accessing and manipulating Finite State Networks and compiler for regular expressions and
replacement rules which will be essential to my work. Additionally, there is a compiler for
two-level morphology rules (twolc) as described by Koskenniemi (Koskenniemi, 1983), but
its application is beyond the scope of my work, so I will leave it aside. XFST also provides
two tools, lookup and tokenize, designed for testing and application of larger projects, but
they won’t be discussed any further in this paper.
In the process of implementing a morphological analyzer, the morphotactics will be defined in
lexc as supposed, whereas phonological/orthographical alternation rules will be defined as
separate transducers (mostly using replacement rules), composed together into a single
transducer, which itself will be composed with the network derived from the lexc definition of
the lexicon to finally result in a lexical transducer which will be used for our final purpose.
Additional transducers can be composed to the network at hand to impose restrictions, define
alternations or add more content.
XFST defines transducers as relations between two languages. What would be referred to as
upper language, could be thought of as the input and the lower language would then be the
output when we apply an input to a transducer downwards. If we apply input to the transducer
upwards then the roles switch – the input is applied on the lower side and the output comes
from the upper side. Although it seems a bit confusing, the terms upper and lower remain
constant. In the definition of a lexical transducer, the upper side language will describe the
lexical (underlying) forms of the language to be analyzed and the lower side language will
contain the actual surface forms in the standard orthography.
1 A brief overview of the formalism is available in the appendix.
13
3. The Model In this section I will present the nominal paradigm of Turkish and my implementation of it.
There are two modules in the model – the lexicon defining the morphotactics of Turkish
nouns and the morphophnonological rules component describing the alternations occurring on
the surface. In Sections 3.1 and 3.2 I will present the theoretical background behind my
model. An important notion in the following sections will be that of archiphonemic
descriptions. As I was implementing the vowel harmony principles using variables for the
alternating vowel segments, I realized that the idea of using variables could be further
employed to describe other phenomena, such as the consonant alternations. My initial
approach, using consonant alternation rules on the surface forms failed to describe the
exceptional cases, so I had to redesign it using unspecified abstract definitions on the lexical
side for entries that do undergo the alternations and underspecify the entries that do not. The
general idea: I will be using both in theory and practice the so-called archiphonemes to
describe classes of similar phonemes that alternate depending on the environment. For
example, to describe vowel harmony I will be using “I” to generalize over the class of high
vowels that alternate according to the principle of i-type vowel harmony and “E” to
generalize over the class of low unrounded vowels that alternate in concordance with the
principle of e-type vowel harmony. The symbols denoting the particular classes of alternating
phonemes will be defined as needed as we proceed further.
3.1 The Nominal Paradigm of Turkish. Morphotactics
The nominal inflectional paradigm is defined in different ways in the various sources. The
basic pattern on which everyone agrees though is:
STEM – NUM – POSS – CASE
Turkish has no distinction of grammatical gender. Worth mentioning is that in some sources,
the relativising suffix –ki is classified as part of the nominal inflectional paradigm. At the
current stage of development I won’t be concerned with it however. On the other side, case-
type suffixes are also differently defined in the various sources – in some of the recent works,
the suffix –(y)la/–(y)le is classified as an instrumental case suffix. We’ll get back to this issue
in the subsequent sections.
So let’s have a closer look at the core of the Turkish noun paradigm. The definition will be
further extended in the subsequent sections.
NUM
0
STEM
1 3
Figure 3.1: A simplified FSA model for the nominal morphotactics in Turkish.
2 4
0
5
0
CASE POSS
14
3.1.1 Inflection for Number
The basic uninflected dictionary form of Turkish nouns is singular (or as claimed in some
sources – “numberless”). The plural form is derived by attaching the –ler/–lar suffix. It
comes generally before any other inflectional suffix. Its vowel is of e-type harmony, therefore
the compact representation using an archiphonemic description will be –lEr. Ketrez (Ketrez,
2003) provides an extensive study on the multiple readings of the Turkish plural morpheme,
but it is mostly from syntactic and semantic points of view and I won’t go any further
discussing the issue.
3.1.2 Case Inflection
Lewis (Lewis, 1967, 1989) defines six cases in his grammar of Turkish. Table 3.1 below
provides an overview of the case paradigm in Turkish:
Case\Last preceding vowel e or i ö or ü a or ı O or u
Absolute (Nominative) - - - -
Definite Objective (Accusative) -(y)i -(y)ü -(y)ı -(y)u
Genitive (of) -(n)in -(n)ün -(n)ın -(n)un
Dative (to, for) -(y)e -(y)a
Locative (in, on, at) -de -da
Ablative (from, out of) -den -dan
Table 3.1: Summary of case suffixes in Turkish.
The bracketed y and n are realized on the surface only if the word the suffix is attached to
ends in a vowel. The locative and ablative suffixes are generally realized as –de/–da and
–den/–dan, but when attached to a word ending in a voiceless consonant (ç, f, h, k, p, s, ş and
t), they are realized as –te/–ta and –ten/–tan respectively. So using archiphonemic
descriptions and the principles of vowel harmony, the case inflection summary will look like:
Case Lexical Form of the Suffix
Absolute (Nominative) -
Definite Objective (Accusative) -(y)I
Genitive (of) -(n)In
Dative (to, for) -(y)E
Locative (in, on, at) -DE
Ablative (from, out of) -DEn
Table 3.2: Summary of case suffixes in Turkish using archiphonemic descriptions.
A few examples will be:
(7) araba → araba-(y)I → arabayı
(car, Nom.) (car, Acc. / LF) (car, Acc. / SF – “the car”)
ev → ev-DE → evde
(house, Nom.) (house, Loc. / LF) (house, Loc. / SF – “in the house”)
15
As mentioned above, some more recent works treat what used to be (and I believe still is) a
postposition (ilE) following absolute or genitive forms as an additional instrumental/
comitative case suffix (–(y)lE). It is however, still used, as far as my knowledge reaches out,
both as a postposition and as a cliticized suffix. I will stick to the classic works for now and
treat it as a separate (non-case) suffix1.
3.1.3 Inflection for Possession
Where in many languages possession is formed using pre-/post-posed pronouns (English:
my/mine, your/yours, his, her/hers, etc.; German: mein (my), dein (your), sein (his), ihr
(her), etc.; Bulgarian: pre-posed – мой ([moy] - my), твой ([tvoy] - your), негов ([negov] -
his), неин ([nein] - her)…; post-posed:.ми ([mi] – my, “of mine”), ти ([ti] – you, “of
yours”), му ([mu] – his, “of his”), и ([i] – her, “of hers”), etc.), in Turkish possession is
expressed by suffixes. The complexity of the possessives varies across languages, depending
on their overall morphological complexity. In Bulgarian for example, the pre-posed
possessives act pretty much like adjectives and typically precede them, so they carry the
inflection for gender, number and definiteness. In Turkish the possessive suffixes are partially
derived from the present tense forms of the verb to be. A summary of the possessive suffixes
is presented in Table 3.3 below:
Person Suffix Gloss
1pSg -(I)m my
2pSg -(I)n your
3pSg -(s)I his/her/its
1pPl -(I)mIz our
2pPl -(I)nIz your
3pPl -lErI their
Table 3.3: Summary of possessive suffixes in Turkish using archiphonemic descriptions.
Again, the bracketed segments surface only in particular conditions. Opposite to the case
suffixes, where the bracketed segments surfaced only if the word they are attached to ends in
a vowel, here the optional segments surface both if the word the possessive suffix is attached
to ends in a consonant (for the first and second person singular and plural) and if the word
ends in a vowel (for the third person singular). So we have vowel deletion in one case and
consonant insertion in the other, to avoid vowel sequences2.
(8) ev → ev-(I)m → evim
(house) (house, 1pSg Poss. / LF) (house, 1pSg Poss. / SF – “my house”)
araba → araba-(I)mIz → arabamız
(car) (car, 1pPl Poss. / LF) (car, 1pPl Poss. / SF – “our car”)
araba → araba-(s)I → arabası
(car) (car, 3pSg Poss / LF) (car, 3pSg Poss. / SF – “his/her car”)
1 Lewis (Lewis, 1967, 1989) states that it is attached to nominative nouns and genitive pronouns, in this sense it
could be considered an additional case suffix. I will leave it aside until I get a clearer view on the issue. 2 More on vowel sequences to come in the description of the rules in the following sections
16
Possessive suffixes precede case suffixes. By having another look at the two inflectional
paradigms one might or might not notice that some of the suffixed forms could occasionally
overlap on the surface. For example: the underlyingly different ev-(y)I (house – Definite
Objective (Accusative) case, “the house”) and ev-(s)I (house – 3pSg possessive, “his house”)
end up absolutely the same on the surface – evi:
(9) ev → ev-(y)I → evi
(house) (house, Acc / LF) (house, Acc. / SF – “the house”)
ev → ev-(s)I → evi
(house) (house, 3pSg Poss. / LF) (house, 3pSg Poss. / SF – “his/her house”)
Things get further complicated if there are multiple instances of the plural suffix –lEr – in the
case of 3pPl possessive for example, if the possessed noun is already plural – evler (houses)
→ *evlerleri → evleri (their houses) – one –lEr gets deleted. So we end up having the single
form evleri for both “their house” and “their houses”. Paying a closer look however, reveals
even further complications: evleri could also denote the accusative case of the plural of
houses (“the houses”) and the 3pSg possessive of the plural of houses – “his/her houses”.
Even though Turkish is morphologically highly specified, we often have 2-,3- or as in this
case 4-fold ambiguities. The derivations from the underlying lexical representations of the
four interpretations of evleri are given in (10) below:
(10) Pl.Acc . Pl.3pPl.Poss. Sg.3pPl.Poss. Pl.3pSg.Poss. (the houses) (their houses) (their house) (his/her houses)
ev-lEr-(y)I ev-lEr-lErI ev-lErI ev-lEr-(s)I
ev-ler-I ev-ler-leri ev-leri ev-ler-i
evleri evleri evleri evleri
Worth to note, just to make things even more confusing, is that after the third person
possessive suffixes, a so-called “pronominal n” is added when there is a case suffix following.
(11) evi (his/her house, also the house)
but:
(12) evinde (in his/her house in our case, but also identical with in your house)
Confusing? Typically ambiguities are resolved by looking at the context where the ambiguous
word occurs – ambiguous forms are usually used with the genitive of the personal pronouns to
avoid confusion. In this case the noun itself reverts to accusative case.
(13) evleri (their house) → onların evi (their house, “the house of theirs”)
(house, 3pPl.Poss) (they, Gen.; house, Acc.)
evleri (their houses) → onların evleri (their houses, “the houses of theirs”)
(houses, 3pPl.Poss) (they, Gen.; houses, Acc.)
evleri (his houses) → onun evleri (his houses, “the houses of his”)
(houses, 3pSg.Poss) (he, Gen.; houses, Acc.)
17
For the purpose of this project, however, I won’t be concerned with morphological
disambiguation, as this task should be performed at a later stage, after examining the already
analyzed context. There are further distinctions in the uses of the possessives in Turkish, but
again, this topic is beyond the scope of my work.
As one might imagine, for a single entry in the lexicon, that is for a single noun stem, there
are plenty of possible inflections - 2x for number times 7x (the six possessive suffixes + the
possession free form) for possession times 6x (or even 7x if the instrumental case is included)
for case inflection, results in 84 basic forms from inflection only (even though some of them
might be identincal), and things get further complicated.
3.1.4 Lexical Exceptions – the ‘su’ case There is only one pure lexical exception to the paradigm – the noun su (water). There is
however a large number of derived noun roots that end in –su, for example: akarsu (river –
“running water”). For this reason, it deserves a special treatment.
The exception manifests itself as su taking the -yun suffix for the genitive (instead of the
standard –nun suffix) and also, in the possessive forms, there is always a y preceding the
possessive suffix – suyum (my water) instead of *sum, suyu (his/her water) instead of *susu.
In general, the y is inserted whenever a suffix starting in a vowel or dropping consonant is
attached to the word.
3.2 Phonological Alternation Rules
In the following subsections I will outline theoretically the basics of the phonological
alternation rules in Turkish with respect to the task at hand.
3.2.1 Resolving Vowel Harmony The vowel harmony principles as described in Section 2.1 are rather simple to implement. I
will present how the basics work and then address some of the exceptional cases.
I split the two harmony classes in two rules – for e-type harmony and for i-type harmony. The
e-type harmony rule checks the value backness feature of the last preceding vowel – if it is a
back vowel the underlying E is realized as a, if it is a front vowel, it is realized as e. Since the
system does not provide us with feature specification of phonemes, I had to define the classes
of vowels as sets:
(14) define BackV [a | ı | o | u];
define FrontV [e | i | ö | ü];
define LowV [a | e | o | ö];
define HighV [ı | i | u | ü];
define UnroundedV [a | ı | e | i ];
define RoundedV [o | ö | u | ü];
The intersection (&) of those sets provides us with the sub-classes of vowels having
combined features. So the set of back unrounded vowels will be derived as:
18
(15) [ BackV & UnroundedV ]
which results in:
(16) [ a | ı ]
This is essential for defining the i-type harmony, as it is based on two features rather than one,
namely backness and roundedness. So, if the last preceding vowel is back and unrounded, the
underlying I is realized as ı (or the hıgh back and unrounded vowel – so to say intersecting the
set of high vowels with the sets describing the features of the last preceding vowel). The same
holds for the other realizations of the undelying I:
(17) I → [HighV & BackV & RoundedV] || [BackV & RoundedV] Consonant1 _
Which should be read as: I is realized as the high-back-rounded vowel (u) in the context of a
back rounded last preceding vowel (o or u). The other rules are identical:
(18) I → [HighV & BackV & UnroundedV] || [BackV & UnroundedV] Consonant _
I → [HighV & FrontV & RoundedV] || [FrontV & RoundedV] Consonant _
I → [HighV & FrontV & UnroundedV] || [FrontV & UnroundedV] Consonant _
This is only necessary to state clearly the principles operating vowel harmony. One migh as
well simply write the rules as: I -> i || [ i | e] Consonant _ , but that won’t have much of a
descrıptıve liguistic value.
In my solution the rules operate in parallel locally, that is for the e-type and the i-type they
operate together among themselves, but the e-type harmnoy still has precedence over the i-
type. The reason behind it – apart from the backness harmony being the more general
principle and having broader coverage, the abstract symbols have to be resolved in a left-to-
right fashion and e-type suffixes at the current stage precede i-type suffixes. We need the
exact properties of the last preceding vowel in order to resolve the next variable vowel in the
following (or even in the same suffix). In this sense, I might need to combine the e-type and i-
type rules into one single rule operating in parallel as the system gets more sophisticated2.
A few words about the exceptions to vowel harmony: We will be concerned with roots whose
last vowel does not have predictive power over the harmonic features of the suffixes attached
to it. Schaaik (Schaaik, 1996) refers to words which induce such exceptions as “disharmonic
roots”. The same term however is used in some sources for roots that do not conform the
principles of vowel harmony internally – the already mentioned in section 2.1 exceptional
cases like anne (mother), çamur (mud), etc. Although they often do overlap, it can’t be stated
that this is always the case. The exceptions we will be dealing with are mostly of foreign
origin: alkol (alcohol), rol (role), saat (clock), etc. are realized as alkolü (alcohol, Acc), rolü
1 Consonant is also a defined class featuring all the consonants 2 A small issue that occured when I accidently switched the order of the rules was that for example in words
having a round vowel in their last syllable (like katalog (catalogue)) were resolved in an unusual way -
*kataloglarunuz, whereas the correct form would be kataloglarınız (our catalogues). This was due to resolving
the –InIz (1pPl Possessive suffix) as –unuz in concordance with the last (resolved) preceding vowel o (the E in
the plural suffix –lEr was still pending resolution). This is important, because if a e-type suffix is added, all the
following suffixes feature unrounded vowels (unless a suffix with an invariable rounded vowel is added).
19
(role, Acc.), saati (clock, Acc.) and alkoller (alcohol, Pl.), roller (role, Pl.), saatler (clock,
Pl.) instead of *alkolu, *rolu, *saatı and *alkollar, *rollar, *saatlar respectively in their
accusative and plural forms.
3.2.2 Consonant Alternation Rules As mentioned in Section 2.1, word final consonants undergo particular alternations depending
on the environment. For the purpose of this project, I used archiphonemic descriptions for the
alternating segments. In most of the related works they pick the capital letter for the voiced
phoneme (B for b and p, D for d and t, etc.) or a capital for the geminating phoneme (S for ss
and s, etc.). I will stick to the standard notation to avoid unnecessary confusion.
The above abstraction is necessary to model the exceptions to these alternation rules. We will
pay some attention to the exceptions in the end of the section. My approach to this issue is
partially based on the paper by Sharon Inkelas and Orhan Orgun (Inkelas, 1997) in which
lexical exceptions are treated in terms of Optimality Theory. In brief: the alternating word
final consonant in regular roots that undergo the alternations will be unspecified in the lexicon
using a special symbol and the exceptional cases will be underspecified with their non-
alternating surface realizations so that they won’t trigger the alternation rules.
3.2.2.1 Final Consonat (De)Voicing
The final consonant voicing occurs when a suffix starting in a vowel or a dropping consonant
is attached to the stem. It covers the voiceless plosives p, ç, and t, which transform into their
voiced counterparts. Additionally, what is often classified separately as a “K/0”1 alternation
(namely because of the subclass of velar consonants k, g and ğ that exhibit similar behavior),
falls into this category as well. (19) Below provides basic notion about the alternations that
occur, where do they occur, and what do the archiphonemic symbols stand for:
(19) B → b || _ Vowel, otherwise B → p
D → d || _ Vowel, otherwise D → t
C → c || _ Vowel, otherwise C → ç
K → k || _ Vowel, otherwise K → ğ
G → g || _ Vowel, otherwise G → ğ
Q → k || _ Vowel, otherwise Q → k
So far it seems fine as far as alternations in the stems are concerned. But similar alternations
occur in suffixes as well. They are dependent on the preceding phoneme and assimilate the
value of its voicing feature. So we have2:
(20) B → p || VoiceLessCons _ , otherwise B → b
D → t || VoiceLessCons _ , otherwise D → d
C → ç || VoiceLessCons _ , otherwise C → c
1 K/O because the counterpart of k intervocally is the so called “yumuşak ge” (soft g), which is phonologically
realized as lengthening of the preceding vowel. 2 For the purpose of this project only the the d/t alternation will be actually used as it is the only one occurring in
the inflectional suffixes of nouns
20
An example for both phenomena where several rules apply, will be the inflection of kitap
(book) in Table 3.4:
Surface Form Lexical Form Alternation Rules Gloss
kitap kitaB B→p book, Sg, Nom.
kitaplar kitaB-lar B→p book, Pl, Nom
kitabım kitaB-(I)m B→b, I→ı book, Sg, 1pSg Poss, Nom.
kitapta kitaB-DE B→p, D→t, E→a book, Sg, Loc.
kitabimda kitaB-(I)m-DE B→p, I→ı, D→d, E→a book, Sg, 1pSg Poss, Loc
Table 3.4: Summary of the application of the phonological alternation rules.
The rules in (19) and (20) are oversimplified of course. In the actual implementation they
feature a wider context including morpheme boundaries to make the distinctions clearer.
In linguistic terms we have regressive assimilation in stems and progressive assimilation in
suffixes. The exceptions to these rules include primarily monosyllabic words that perserve the
quality of their final consonant. There are however monosyllabic words that do undergo the
alternation rules, as there are polysyllabic words that do not. Such exceptions will be
underspecified in the lexicon with their unchanging consonant.
3.2.2.2 (De)Gemination
Apart from the final stop voicing/devoicing, which is the most productive type of consonant
alternation a few other types of alternations are worth mentioning. The final consonant
(de)gemination occurs only in a small number of Arabic loan words. The nature of this
phenomenon is similar to the one of the final consonant (de)voicing – a word final segment
gets doubled if a suffix starting in a vowel (or dropping consonant) is attached to the word:
(21) his (feeling) → hissi (feeling, Acc., “the feeling”) → hisler (feelings)
hat (line) → hattı (line, Acc., “the line”) → hatler (lines)
Again, we will have to employ special symbols that will be realized differently on the surface
depending on the context as proposed by Schaaik (Schaaik, 1996)1. He proceeds even further,
investigating the dependence of these alternations on the re-syllabification processes
occurring with the different suffixes. I will not go into detail however, as my project is not
intended to feature a syllabification module in its current stage of development.
3.2.3 Other Alternations Two other alternations are worth mentioning for the sake of completeness. One of them
involves vowel insertion/deletion and the other describes the status of the glottal stop in
Turkish. The first one is rather common, whereas the second operates on a limited domain of
Arabic loan words. Both of them show some ambiguities.
1 This issue could be approached differently, by underspecifying the geminating stems with their double
consonants in the lexicon and then removing the additional segment if necessary.
21
3.2.3.1 Vowel Insertion/Deletion
Some stems in Turkish exhibit an interesting property of forming stem final consonant
clusters via vowel epenthesis:
(22) burun (nose) → burnu (nose, Acc. “the nose”)
fikir (idea) → fikri (idea, Acc. “the idea”)
şehir (city) → şehri (city, Acc. “the city”)1
ömür (life) → ömrü (life, Acc. “the life”)
alın (forehead) → alnı (forehead, Acc. “the forehead”)
This phenomenon occurs again whenever a suffix starting in a vowel is attached to the stem
(seems like all the stem-internal alternations in Turkish are conditioned on the same context).
The epenthesized vowel is always a high vowel, but its other features cannot be automatically
determined, so it has to be hard-coded. Such stems will be indicated in the lexicon with a meta
character preceding the vowel which is to be deleted.
As for the quality of the consonant clusters that are formed after the epenthesis occurs, there
have been several attempts to define the possible consonant sequences in such cases, but this
is far beyond the scope of this paper.
3.2.3.1 The Glottal Stop
This, along with the gemmination rule, is probably the most improductive rule in Turkish.
They both concern only a limited number of arabic loan words. The nature of the glottal stop
is not quite clear to me, however I attempted an approach based on Schaaik’s (Schaaik, 1996)
description and the Turkish Lexical Database Project (TLDP). Schaaik (Schaaik, 1996)
describes two types of glottal stop:
(23) Type 1: ^ -> 0 / ^ (0 if a consonant follows and ^ if a vowel follows)
cami^ (mosque) -> camiler (mosques)
-> cami^i (the mosque / his/her mosque)2
Type 2: ‘ -> i / ‘ (i if a consonant follows and ‘ if a vowel follows)
nev’ (sort) -> neviler (sorts)
-> nev’i (the sort / his/her sort)
(Schaaik, 1996, pp. 114)
Both are supposed to act as consonants if a vowel follows.
In modern Turkish however, the glottal stop is mostly omitted both in speech and writing. It is
preserved only when ambiguities occur – telin (of the wire / your wire) and tel’in
(denunciation). Apparently, in TLDP the glottal stop is not featured either. Both cases are
1 In modern Turkish, the tendency is to retain the i in şehir (city) – şehiri (the city) 2 The Type 1 glottal marker ^ is not manifesting itself orthographically.
22
accepted there – camii and camisi both denote the 3pSg Possessive form (his/her mosque),
identically camii and camiyi both denote the accusative case (the mosque). For the second
type though, only yeisi (the despair / his/her despair) and neviyi (the sort) / nevisi (his/her sort) are recognized. So the first type allows for both realizations, whereas the second type
behaves more or less as if it wasn’t there at all.
In my solution, I tried to approach the issue as in the TLDP. There are some mismatches
though, and even though it is more likely that the mistake is overgeneration from my side, it is
also possible that the TLDP analyzer has some flaws. The examples I am concerned with are:
(24) camim (mosque, 1pSg Poss. “my mosque”)
vs.
(25) camiim (mosque, 1pSg Poss. “my mosque”)
Analogous to camii and camisi (his/her mosque), they should both denote the same thing, but
the TLDP analyzer provides different solutions, where only the first one (camim) seems to be
proper. I have to investigate the issue further. For now, in my project they will both stand for
“my mosque”.
3.3 Implementation
The model comprises of two components – the lexicon, defined in lexc, describing the
morphotactics of Turkish (technically it is implemented as an FSA, but it does include some
transductions for the tags), and a set of rules, that describe the morphophonological
alternations that occur on the surface (implemented naturally by a set of FSTs in xfst, using
the formalism of replacement rules).
3.3.1 The Lexicon The lexicon network implemented in lexc describes the morphotactics of the Turkish nominal
inflection. First of all, there is a multicharacter symbols definition (26) where a set of
sequences of symbols that should be treated as atomic symbols is defined:
(26) Multichar_Symbols +Noun +Poss +Case +1p +2p +3p +Sg +Pl +DefObj +Gen
+Dat +Loc +Abl +Abs
These are primarily used to define the tags to be used (case marking, possession, number,
etc.). Further on, it contains a sub-lexicon of the noun stems – it is the simplest, but most
important part – it contains the noun stems in their lexical (underlying) form, which could be
automatically extracted from a dictionary. This form includes all the special symbols that
denote alternating segments and trigger the alternation rules. Then on the next stage (the
standard continuation class for all nouns) a tag +Noun is attached on the upper side, that is, it
visible only if morphological analysis (or lookup) is performed (same for all the other tags).
On the lower (surface) it is realized as an epsilon. The continuation class from there is the
number lexicon – number suffixes are attached on the lower (surface) side and tags +Sg and
+Pl are attached on the upper (lexical) side, (the dash “ – “ stands for morpheme boundary):
23
(27) LEXICON Number
+Sg:0 Possessive;
+Pl:-lEr Possessive;
A possessive sub-lexicon follows which defines the inflection for possession as described in
Section 3.1.3 with the appropriate tags. There is an intermediate lexicon however, that
specifies the optionality of the possessive suffixes:
(28) LEXICON Possessive
+Poss:0 PSuff;
+Case:0 CSuff;
That is, either take a possessive tag +Poss and go to the lexicon of possessive suffixes, or take
a +Case tag and go to the lexicon of case suffixes. So the actual sub-lexicon for the
possessive suffixes is called PSuff:
(29) LEXICON PSuff
+1p+Sg:-*Im Case; ! "my"
+2p+Sg:-*In Case; ! "your"
+3p+Sg:-*sIN Case; ! "his/her/its"
+1p+Pl:-*ImIz Case; ! "our"
+2p+Pl:-*InIz Case; ! "your"
+3p+Pl:-lErIN Case; ! "their"
After taking a possessive suffix there is again an intermediate stage that should be passed –
the possessive forms still have to take a +Case tag. In the morphological analysis module of
the Turkish WordNet® the possessive markup is obligatory. It is referred to as possessive
agreement there, and if there is none, then the tag is +Pnon. I don’t find it necessary for now,
but of course it won’t be any problem to tune my system up so that it features the same type
of mark-up.
Two more points to make clear: the optional segments which were marked with brackets in
the theoretical part are prefixed with an optionality marker (*); the pronominal n is denoted
by the capital N. Oflazer (Oflazer, 1995) defines it as a part of the case suffixes. In my case it
is an optional segment that surfaces only if there is a suffix following the third person singular
and plural possessive forms. In his case, there are two copies of each case suffix – one that
follows the third person possessive form and one for all the other possessive and non-
possessive forms. To me it seems more intuitive to have it as a part of the possessive, as it is
indeed a “pronominal n”, and I don’t find much sense in having two instances of every case
inflection.
(30) LEXICON CSuff
+DefObj:-*yI #; ! Definite Objective Case (Accusative)
+Gen:-*nIn #; ! Genitive Case - posessive, "of"
+Dat:-*yE #; ! Dative Case - (indirect object) "to", "for"
+Loc:-DE #; ! Locative Case - "in", "on", "at"
+Abl:-DEn #; ! Ablative Case - "from", "out of", "througn"
+Abs:0 #; ! Absolute (dictionary) form (Nominative)
24
The last component of our lexicon is the case inflection sub-lexicon. It is obligatory, as all
uninflected nouns are in their absolute form (Nominative case). The hash symbol (#) is an
anchor symbol denoting word boundary (in replacement rules it is circumfixed by dots (.#.)).
To summarize, a visual map of the lexicon network is presented in Figure 3.2 below:
3.3.2 The Rules Component
The rules component of the system is implemented as a sequence of composed transducers in
xfst using the formalism of replacement rules. It currently features 17 rules, of which 12 are
significant and 5 are just for cleaning up the markers1. The rules are composed in a particular
1 I prefer keep them apart in the development stage, as it often happens that I need to preserve some markers in
order to see what exactly has gone wrong in case of an error.
3.Number
1.Noun
2.NN
7.CSuff
Figure 3.2: Schematic visualization model of the lexicon network
6.Case 5.PSuff
0.Root
8. #
4.Possessive
/Noun
Stems/
0
+Noun:0
+Sg:0 ,
+Pl:lEr
+Poss:0
+Case:0
+Case:0
+1p+Sg:-*Im ,
+2p+Sg:-*In ,
+3p+Sg:-*sIN ,
+1p+Pl:-*ImIz ,
+2p+Pl:-*InIz ,
+3p+Pl:-lErIN
+DefObj:-*yI ,
+Gen:-*nIn ,
+Dat:-*yE ,
+Loc:-DE ,
+Abl:-Den ,
+Abs:0
25
sequence, as some of them do depend on each other. Full independence is hardly achievable.
In the case of a dropping vowel in the stem for instance, the vowel harmony rules have to
apply before the vowel is deleted, since the suffixes have to harmonize with this vowel. This
is especially true for monosyllabic roots that lose their one and only vowel. The rules are split
(for now) in several groups addressing the different phenomena types that they describe. A few classes needed to be defined in order to make the rules operational. I defined a class for
the vowels and consonants initially, where the consonant class had to be extended to feature
all the archiphonemic descriptions used. As already mentioned, the vowels are further divided
into subclasses according to their features for the vowel harmony resolution. Further on, for
the rule of progressive assimilation in suffixes, I had to define a class of voiceless consonants.
3.3.2.1 Vowel Harmony Rules
So far, the rules for e-type and i-type vowel harmony are split into two separate rules (which
operate in parallel among themselves), where the e-type precedes the i-type harmony
resolution, but they might need to be merged into a single rule operating in parallel on all the
harmonizing segments.
As mentioned above in Section 3.2.1, an underlying E is realized as e on the surface when the
last preceding vowel is a front vowel and as a when the last preceding vowel is a back vowel.
This defines the e-type harmony rule:
(31) E -> e || FrontV ~$[Vowel] _ ,,
E -> a || BackV ~$[Vowel] _
The dollar sign ($) has a special meaning in xfst – „contains“. The tilde (~) on the other side
stands for a complementation operator – negation (in this case: negation of the language that
contains vowels). In simple words the left context should be read as: there is a front vowel on
the left and between it and the symbol to be resolved (E), there are no other vowels. Same for
the second line, only that it concerns back vowels in the left context. A thing to mention, the
double commas (,,) in xfst replacement rules stand for parallel operation (as opposed to the
composition operator (.o.) which stands for sequential operation). In other words, this is a
two-level rule. Same for the i-type harmony rule, only that it considers two features (backness
and roundedness) of the last preceding vowel.
(32) I -> i || [FrontV & UnroundedV] ~$[Vowel] _ ,,
I -> ü || [FrontV & RoundedV] ~$[Vowel] _ ,,
I -> ı || [BackV & UnroundedV] ~$[Vowel] _ ,,
I -> u || [BackV & RoundedV] ~$[Vowel] _
As far as vowel disharmony in suffixes is concerned, the stems that induce such disharmony
will be marked as such (again this could be implemented as an automated procedure) by
inserting a (dis-)harmony marker after the last vowel of the stem. The disharmony marker
itself will be nothing more than the vowel that induces the new vowel harmony, prefixed by a
harmony marker (H). For example: alkol (alcohol) which transforms into alkolü (instead of
*alkolu) (the alcohol) and alkoller (instead of *alkollar) (alcohol, Pl) will be lexically
represented as alkoHül.
26
3.3.2.2 Consonant Alternation Rules
The most productive consonant alternation rule as described in Section 3.2.2.1 is the final stop
devoicing rule. Similar rules (both in operation and conditions) are the K/0 alternation rule
and the consonant germination rules. The suffix onset (de)voicing rule will also fall in this
category.
For these rules, as already mentioned, I had to use abstract symbols denoting the alternating
phonemes (just as in the vowel harmony rules). As we’ve already had an extensive overview
of the principles behind these rules I will not discuss them any further.
(33) Final Consonant Devoicing Rule:
[ B -> b , C -> c , D -> d || _ %- (%*) Vowel ] .o.
[ B -> p , C -> ç , D -> t || _ [[%- (%*) Cons] | .#.]]
Velar Alternation Rule:
[ G -> ğ, K -> ğ, Q -> g || _ %- (%*) Vowel ] .o.
[ G -> g, K -> k, Q -> k || _ [[%- (%*) Cons] | .#.]]
Suffix Onset Devoicing Rule:
[ C -> ç , D -> t , G -> k || VLCons %- (%*) _ ] .o.
[ C -> c , D -> d , G -> ğ || ~VLCons %- (%*) _ ]
Gemination Rule:
[ S -> s, T -> t || _ [%- Cons | .#. ] ] .o.
[ S -> [ s s ] , T -> [ t t ] ]
The percent sign (%) is used as an escape character in xfst to literalize characters that have a
special meanings otherwise. The anchor marker (.#.) is used to denote word boundaries (the
beginning of string if used on the left and the end of string if used on the right). The brackets
denote optionality in the regular expressions sense – (%*) in a replacement rule means “there
is a possible literal * there”. A few notes on the germination rule: it is a rather radical
approach as far as context is concerned, it could be improved though in case of failure; so far
it covers only the cases of geminating s and t. They were chosen at random out of the set of
eight geminating consonants in Turkish, just to implement the principle. For the remaining six
consonants, special symbols have to be chosen and their transformations need to be inserted in
the rule (pure mechanical operation). There are however some special cases, where
germination and devoicing occur simultaneously:
(34) muhip (friend) → muhibbi (friend, Acc., “the friend”)
and the even further complicated case of serhat (border), which is an exception to vowel
harmony, besides undergoing germination and voicing – serhaddi (border, Acc., “the
border”). This issue could be fixed using a few minor tricks and the current system is ready to
handle it, but I will leave it for a later stage of development.
27
3.3.2.3 Fixing the Morphotactics
The few next rules are used to “fix the morphotactics” – they deal with general phenomena
such as vowel/consonant deletion, the pronominal n, the lexical exception “su” and the
elimination of the multiple plural morpheme.
The rule for multiple plurals simply takes two adjacent plural morphemes and rewrites them
as a single morpheme, nothing unusual.
The rule for the pronominal n simply drops the N word finally (a tricky solution).
The rule for the glottal stop is again a tricky solution. As we’ve seen, the ^ marker is either
realized as underlying consonant or as nothing at all. My approach was to optionally delete it
if a vowel follows:
(35) [ %^ (->) [.0.] || _ %- %* ]
This way, both camii and camisi/camiyi will be recognized as described in Section 3.2.3.1.
Next, the rule for dropping stem vowels has nothing particularly interesting to it. Dropping
segments are prefixed by a literal dollar sign ($), so an underlying koy$un (bosom) will be
realized as: koyun (bosom) and koynu (the bosom) (as opposed to koyun (sheep) which is
realized as: koyunlar (sheep, Pl) and koyunu (the sheep)). The dropping segments remain if
the suffix attached starts in a consonant on the surface.
The rule for the exceptional class of words ending in “–su” (water) is again pretty simple, the
words ending in su are specified in the lexicon as suY (this is partially from the origin of the
word – historically it derives from “suw”). This special symbol is then realized as y in the
proper context or as epsilon by default. It took me quite a while to come to this idea. I was
happy to see that others have approached the issue in a similar way1.
The most complicated rule, and the one that took me the most time to design optimally (and
which is still under consideration whether it is the best solution or not) is the rule that
manages all the dropping consonants and vowels in suffixes (except the pronominal n). I
called it fixing the vowel sequences, as this is more or less what it is supposed to do. In the
case suffixes we have y and n insertion if the stem ends in a vowel. On the other side, we
have high vowel (I) deletion if the stem ends in a vowel and s insertion if the stem ends in a
vowel in the possessive inflectional paradigm. In simple terms, all these phenomena occur to
avoid vowel sequences. After quite a bit of thinking I dealt with all these phenomena in a
single blow:
(36) [? - HighV] -> 0 || Cons %- %* _ ] .o.
[ HighV -> 0 || Vowel %- %* _ ]
The above composition of two rules does two things, namely: 1. It deletes every segment that
is not a high vowel (I), marked up as optional, in the context of a preceding consonant across
a morpheme boundary and 2. It deletes every high vowel (I), which is marked up as optional
in the context of a preceding vowel across a morpheme boundary.
1 See (Schaaik, 1996).
28
The remaining rules clean up the marker leftovers. The clean up procedures can be
incorporated in the rules themselves, but during the development stage, I prefer to keep them
separated for debugging purposes.
3.3.2.4 Rule Order
A few notes on the current rule ordering.
FixMultiPlural
.o.
eTypeHR
.o.
iTypeHR
.o.
SUexception
.o.
PronominalN
.o.
FixGlottal
.o.
FixVowelSeq
.o.
FinalStopDevoicing
.o.
VelarAlternations
.o.
Gemination
.o.
SuffOnsetDevoicing
.o.
StemVDeletion
.o.
ClearMBMarker
.o.
ClearOptMarker
.o.
ClearSVDMarker
.o.
ClearGlottal
.o.
ClearExHarmony
First thing’s first, getting rid of the multiple plural morpheme is a good thing to start with.
There are some local dependencies among the rules, like already mentioned, the e-type
harmony rule has to precede the i-type harmony rule (or probably they will have to be merged
in a single rule and apply simultaneously as two-level rules). Also the vowel harmony
resolution shall precede the stem vowel deletion. If we proceed from left to right (with
parallel rules), the stem vowels will be deleted before the vowels in the suffixes which shall
Figure 3.3: The current rule ordering
29
harmonize with the deleted vowel are resolved. There should also be some tendency to go
from simpler and more general to more sophisticated and specific rules (either in upward or
downward direction). Such however is not present in the current stage of development. The
final stop devoicing, the velar alternations, the germination, stem vowel deletion and the rule
for the su exception could all operate at a single stage as they occur in identical contexts and
their purpose is more or less the same. The suffix onset devoicing rule is partially dependent
on the outcome of the final stop devoicing rule, but if the input is processed left to right, this
will be determined before the application of the suffix onset devoicing rule. The pronominal n
rule is also on its own, so getting the bigger picture, in the end it seems that the rules are
mostly independent. All that maters is to process the input sequentially, from left to right. And
therefore if we have the wrong rule ordering, rules that apply on segments that occur after
unresolved segments might cause major troubles. This is the reason why most finite state
approaches to Turkish morphology are based on two-level morphological descriptions.
4. Conclusions In order to analyze the complex and often symbiotic relations between words, one needs first
to determine the exact properties of each and every individual token. Some of the properties
however, could only be determined after examining the environment. The common approach
to this issue is “inside-out” (or bottom-up) – starting from the basic entities and building up
increasingly complex structures out of them. In this paper I presented an approach to part of
the “basic entities” in the Turkish language.
5. Future Work Where do we go from here on? One could come up with various ideas. I myself am not so
sure which way this project will take. First, before everything else, the model has to be
completed to cover the other major word category in Turkish, as well as the minor word
categories, to result in a full-featured morphological processor. Then perhaps, to extend
functionality a lexicon extraction routine has to be implemented, that automatically extracts
entities from a dictionary into the morphological processor. This could be combined with a
morphological guesser, and the two could form a symbiotic relation, in which the former will
be used to train the latter, and the guessing algorithm will occasionally provide substance for
the extension of the lexicon. Further on I am thinking of implementing a syllabification
module as it seems quite necessary, as well as perhaps stress markup. Having a fully
functional morphological processor at hand, there are various ways one could take: Integrate
it into a larger NLP system (speech synthesis/recognition applications, automatic machine
translation applications, language tutoring applications, artificial intelligence components,
OCR applications, supplemental linguistic applications); extend its functionality for different
tasks (a major advantage of the modular approach – simply add a new module for the task at
hand and occasionally tune up the existing modules); add a context component for
disambiguation (this falls in the previous category perhaps); try approaching a different
language, and numerous other options in the field. As a first step however, a complete
coverage of the language of choice has to be accomplished.
30
Bibliography: Dik, Simon C. 1981. Functional Grammar 3rd Ed,. Foris. Dordrecht. The Netherlands.
Clements, George N. and Engin Sezer. 1982. Vowel and Consonant Disharmony in Turkish. Linguistic Models:
The Structure of Phonological Representations (Part II), ed. by H. van der Halst and N. Smith. Foris Publishing,
Dordrecht, Holland.
Hankamer, Jorge. 1986. Finite State Morphology and Left to Right Parsing. Paper, 3rd International Conference
on Turkish Linguistics, August 1986, Tilburg, The Netherlands.
Hopcroft, J.E. 1979, Ullman, J.D., Introduction to Automata Theory, Languages and Computation. Addison –
Wesley.
Inkelas, Sharon. C. Orhan Orgun. 1997. The Implications of Lexical Exceptions for the Nature of Grammar.
Derivations and Constraints in Phonology. Roca, Iggy; Clarendon Press, Oxford. 1997.
Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton. The Hague. Paris.
Karttunen, Lauri, with Kenneth R. Beesley. 2003. Finite State Morphology. CSLI Publications. Stanford.
Ketrez, F. Nihan. 2003. Multiple Readings of the Plural Morpheme in Turkish. USC. USA. (online at:
http://www-scf.usc.edu/~ketrez/papers/ADL2003ketrez.pdf - 25.06.2005)
Koskenniemi, Kimmo. 1983. Two-level Morphology. A General Computational Model for Word-Form
Recognition and Production. Department of General Linguistics. University of Helsinki.
Köksal, A. 1975. A First Approach to a Computerized Model for the Automatic Morphological Analysis of
Turkish. Doctoral Dissertation, Hacettepe Universitesi, Ankara.
Lewis, Geoffrey. 1967. Turkish Grammar. Oxford University Press. Oxford.
Lewis, Geoffrey. 1989. Turkish 2nd ed. (Teach Yourself Books). Hodder and Stoughton. London.
Oflazer, Kemal. 1994. Two-level Description of Turkish Morphology. Linguistic and Literary Computing.
(online at: http://acl.ldc.upenn.edu/E/E93/E93-1066.pdf - 25.06.2005).
Oflazer, Kemal. Elvan Göçmen and Cem Bozşahin. 1995. An Outline of Turkish Morphology. Technical Report.
Middle East Technical University (online at: http://www.lcsl.metu.edu.tr/ftp/papers/morphspecs.ps.gz -
18.07.2005).
Pollard, Asuman Çelen; Pollard, David. 1996. Turkish: A complete course for beginners. (Teach Yourself
Books). Hodder and Stoughton. London.
Schaaik, Gerjan van. 1996. Studies in Turkish Grammar. Harrassowitz Verlag, Wiesbaden, Germany.
Sebüktekin, Hikmet I. 1971. Turkish-English Contrastive Analysis. Turkish Morphology and Corresponding
English Structures. Mouton. The Hague. Paris.
Useful links:
http://www.hlst.sabanciuniv.edu/TL/ - The Turkish Lexical Database Project - provides
morphological analysis to verify the results
http://www.turkishdictionary.net/ - Turkish online dictionary – additional glossary
http://www.google.com/ - Everything is there! – using the web as a corpus
31
Appendix A: List of Abbreviations
CASES:
Nom./+Abs - Nominative/Absolute
Acc./+DefObj - Accusative/Definite Objective
Dat./+Dat - Dative
Gen./+Gen - Genitive
Loc./+Loc - Locative
Abl./+Abl - Ablative
NUMBER/POSSESSIVE:
Sg./+Sg - Singular
Pl./+Pl - Plural
(+)1p/2p/3p - 1/2/3 Person
Poss./+Poss - Possessive
GENERAL:
FST - Finite State Transducer
FSA - Finite State Automaton (-ta)
FSN - Finite State Network
LF - Lexical Form (lexicon entry form)
SF - Surface Form (standard orthographical representation)
32
Appendix B: lexc Code Samples
!##############A lexc solution to the network in Figure 2.2########################
LEXICON Root !#The start state so to say. Every lexicon needs it.
One; !#A line in lexc has two components:
!#1. An expression (which could be as complex as needed)
!#2. A continuation class
!#Think of the expression as the symbol over the arc and
!#the continuation class as the destination state
Lexicon One !#Figuratively speaking – State 1
a Two; !#The two arcs with the respective input symbols and destinations
b Three;
Lexicon Two !#State 2
b Three;
Lexicon Three !#State 3
#; !#The hash symbol denotes end of input, or a final state
c One; !#The loop back to State 1
!#################A model lexc solution for Figure 2.3###########################
!#Same as above for the most part
LEXICON Root
One;
Lexicon One
a:A Two; !#The semicolon operator denotes a transduction here
b:B Three; !#Basically the expressions could be regular expressions
!#with varying complexity, combining various operations,
!#but as my key concept is modularity, I will try to keep
!#them as simple as possible.
Lexicon Two
b:B Three;
Lexicon Three
#;
c One;
33
Appendix C: On Replacement Rules
Replacement rules are simply intuitive and convenient shorthands for more complex regular
expressions. The most general shape of a context-free replacement rule is:
A->B
where A and B are regular languages (which could be arbitrarily complex regular expressions
themselves). In this case, every string from the upper language (the universal language1) is
mapped to itself, except that whenever a substring from A is encountered, it is related to a
substring from B (opposed to normal transducers where if the input string doesn’t match a
string from the upper language, nothing happens and there is no output). This formalism is
further extended to include context:
A->B || L _ R
where A, B, L and R all denote languages and not relations (both L and R are optional). What
happens here is essentially the same as above, only that the languages A and B are further
contextually restricted. A substring from A is related to a substring from B, only if it is
preceded by a substring from L and followed by a substring from R. The double vertical bars
separate the rule(s) from the context. Different rules operating in the same context are
separated by a comma:
A->B , C->D || L _ R
The same is valid for contexts:
A->B || L1 _ R1 , L2 _ R2
Replacement rules could be constructed to operate in parallel (as in two-level models) using
double comma (,,) separator:
A->B || L1 _ R1 ,, C->D || L2 _ R2
Or composed as standard networks:
[A->B || L1 _ R1] .o. [C->D || L2 _ R2]
The difference is crucial if the rules are dependent on each other.
These are the basics. For more information on XFST and its replacement rules refer to
(Karttunen, 2003).
1 The language of all possible strings.