turkish nomina

0

Finite State Morphology:

The Turkish Nominal Paradigm

A Thesis by

Philip Makedonski

/[email protected]/

Submited to

Seminar für Sprachwissenschaft

Eberhard Karls Universität Tübingen,

72074 Tübingen, Germany

In fulfillment of the requirements for the degree

Bachelor of Arts in Computational Linguistics

July 2005

1

ABSTRACT

Finite State Morphology: The Turkish Nominal Paradigm

Makedonski, Philip

Seminar für Sprachwissenschaft

Eberhard Karls Universität Tübingen

Supervisor: Dr. Dale Gerdemann

July 2005

24 Pages

In this thesis my goal is to present a finite state approach to the inflectional morphology of

Turkish nouns, the ultimate goal being building a morphological analyzer for Turkish nouns.

We’ll be dealing primarily with the principles of vowel harmony across the different

inflectional noun suffixes in Turkish as the most interesting phenomenon and my

implementation of these principles in the Xerox Finite State Toolbox (xFST). We will also

pay attention to the other morphophonological alternations occurring both in the stem and the

suffixes attached to it as a result of the inflectional processes.

Keywords: Natural Language Processing, Finite State Networks, Turkish

Morphology, Computational Linguistics

2

To my family, to my love

3

ACKNOWLEDGEMENTS

First, I’d like to thank my supervisor Dr. Dale Gerdemann for his support and

advisory over this project. I appreciate the freedom and independence I had for

the choice of topic and approach.

I would also like to thank Dr. Sandra Kübler for her support and understanding

throughout this course of studies, which in many cases turned out to be the

crucial for my progress.

Many, many thanks to my family for their support all the time, no matter what

was happening. Thanks to my friends for their understanding.

And most of all, special thanks to Nevin Recep for sparkling my interest in the

Turkish language and supporting me all the time.

4

TABLE OF CONTENTS

ABSTRACT........................................................................................................................................................... 1

DEDICATION....................................................................................................................................................... 2

ACKNOWLEDGEMENTS.................................................................................................................................. 3

TABLE OF CONTENTS...................................................................................................................................... 4

1. INTRODUCTION ....................................................................................................................................... 5

1.1 MOTIVATION ....................................................................................................................................... 5 1.2 MORPHOLOGY..................................................................................................................................... 5 1.3 RELATED WORK ................................................................................................................................. 6 1.4 OVERVIEW........................................................................................................................................... 7

2. BACKGROUND.......................................................................................................................................... 7

2.1 TURKISH .............................................................................................................................................. 7 2.2 FINITE STATE TECHNOLOGY ............................................................................................................ 10

2.2.1 Finite State Automata (FSA) ....................................................................................................... 10 2.2.2 Finite State Transducers (FST’s) ................................................................................................ 11

2.3 XFST ................................................................................................................................................. 12

3. THE MODEL............................................................................................................................................. 13

3.1 THE NOMINAL PARADIGM OF TURKISH. MORPHOTACTICS............................................................ 13 3.1.1 Inflection for Number .................................................................................................................. 14 3.1.2 Case Inflection.............................................................................................................................. 14 3.1.3 Inflection for Possession .............................................................................................................. 15 3.1.4 Lexical Exceptions – the ‘su’ case ............................................................................................... 17

3.2 PHONOLOGICAL ALTERNATION RULES ........................................................................................... 17 3.2.1 Resolving Vowel Harmony........................................................................................................... 17 3.2.2 Consonant Alternation Rules....................................................................................................... 19

3.2.2.1 Final Consonat (De)Voicing ....................................................................................................................... 19 3.2.2.2 (De)Gemination ........................................................................................................................................... 20

3.2.3 Other Alternations........................................................................................................................ 20 3.2.3.1 Vowel Insertion/Deletion ............................................................................................................................. 21 3.2.3.1 The Glottal Stop ........................................................................................................................................... 21

3.3 IMPLEMENTATION............................................................................................................................. 22 3.3.1 The Lexicon .................................................................................................................................. 22 3.3.2 The Rules Component .................................................................................................................. 24

3.3.2.1 Vowel Harmony Rules ................................................................................................................................. 25 3.3.2.2 Consonant Alternation Rules ...................................................................................................................... 26 3.3.2.3 Fixing the Morphotactics ............................................................................................................................ 27 3.3.2.4 Rule Order ................................................................................................................................................... 28

4. CONCLUSIONS........................................................................................................................................ 29

5. FUTURE WORK....................................................................................................................................... 29

APPENDIX A: LIST OF ABBREVIATIONS.................................................................................................. 31

APPENDIX B: LEXC CODE SAMPLES......................................................................................................... 32

APPENDIX C: ON REPLACEMENT RULES................................................................................................ 33

5

1. Introduction

1.1 Motivation

In morphologically rich languages like Bulgarian, Turkish, Russian, Spanish and many others,

grammatical features and functions typically assigned to the syntactic structure in

morphologically poor languages like English, are often represented in the morphological

structure. As a consequence, any form of an adequate Natural Language Processing (NLP)

application would require a good morphological component due to the increased role of

morphology in these languages. This in turn would require a rich lexicon, and building up a

lexicon, explicitly listing all the possible forms as separate entries, would quickly explode into

an unmanageable size due to the rich inflectional and derivational possibilities for a single

base (dictionary) form (stem). In Turkish for example, the nominal inflectional paradigm has

three basic types of suffixes – for number, possession and case (the number varies in the

different sources), and the verbal inflectional paradigm is even more complicated with its

eight affixes (again, the number might be different depending on the source). There are

approximately 20.000 stems and 300-400 roots actively used in Turkish, which effectively

amount to millions of inflected and derived forms. This further increases the demand for an

automated morphological analysis.

As it turns out, morphological structures are much more regular than syntactic ones. They can

be handled very efficiently and accurately using sets of rules and compact lexicons of base

forms (stems). Furthermore, important semantic and grammatical information could be

encoded in such lexicons as well.

1.2 Morphology

The central concepts of morphology are morphotactics and (morpho)phonological

alternations. Morphotactics (also morphosyntax or word formation) defines the constraints on

possible morpheme combinations. Phonological (also orthographical) alternations define the

changes in morphemes occurring in particular environments. To illustrate the issue an

example from Karttunen (Karttunen, 2003) comes at hand:

(1) pity → piti-less → piti-less-ness

(Karttunen, 2003, pp. XVI)

Morphotactic definition accounts for the acceptability of a word like piti-less-ness and the

unacceptability of a word like *piti-ness-less. Phonological alternations on the other side

describe why pity is realized as piti in the context of a following less. These are simple

examples that could be caught easily with a few basic rules. But for a full scale NLP, one

needs a much more sophisticated system. This is especially valid for agglutinative languages

like Turkish where the concept of a word is much wider. Different relations between the

words in a sentence are mostly expressed by affixes. Furthermore, many affixes and roots in

Turkish change their shape depending on the environment and have to obey various

constraints like vowel harmony.

6

1.3 Related Work

A significant amount of work has been done in the computational modeling of Turkish

morphology already: Köksal’s first approach to a computerized model for automatic

morphological analysis of Turkish (Köksal, 1975); Hankamer’s description in terms of finite

state morphology (Hankamer, 1986); numerous recent works by Kemal Oflazer based on his

Two-Level model of Turkish (Oflazer, 1994); Schaaik’s Studies in Turkish Grammar

(Schaaik, 1996) is a comprehensive guide to building a computational model for full nominal

phrases using the functional grammar formalism (Dik, 1981). For the earlier works are hard to

find, I will briefly discuss only the more recent works by Oflazer and Schaaik as closely

related to what I am doing in this project.

Oflazer’s work is based primarily on his two-level model for Turkish morphology (Oflazer,

1994). The idea behind the two-level models originates from Koskenniemi (Koskenniemi,

1983). The most significant difference from the ordered linear approach in composed

sequences of rule transducers1 is that all the rules operate in parallel. To illustrate the

difference, a basic two-level model and a cascade-based model relating the languages defining

the lexical and surface forms are presented in Figure 1.1 below:

In the cascade-model of composed rule transducers, each transducer operates on its own input

and output, producing an intermediate output to feed the next transducer in the cascade. With

the key concept here being “feed”, the major drawback of the two-level models has been that

in the case of bleeding or feeding relations between rules (which is often the case in

generative phonology), it is hardly possible to define such relations within this approach

1 More on transducers and automata follows in the technical background on finite state technology in Section

2.2. For now think of rule transducers simply as a way to implement rules.

FST n

FST 2

FST 1

Lexical Form

Surface Form

…

Intermediate

forms

Figure 1.1: Cascade-based and two-level (parallel) models in finite state

morphology.

FST n FST 2 FST 1

Lexical Form

Surface Form

…

7

(apart from having to design the rules very carefully in order to get the necessary result). But

the convenience of the cascade-based model from this perspective comes at a price. In the

process of composition, the network could easily “explode” into unmanageable size as many

parts of it may need to be copied. Luckily there are some techniques to restrict such growth.

My project combines both models in a way as we shall see later. The advantage being,

whenever parallel operation of rules is needed, we’ll use one, and whenever sequential

(linear) operation of rules is needed, such will be used.

1.4 Overview

In the following sections I will present a finite state approach to a part of the Turkish

morphology. I will focus on the nominal morphology only, in particular the different

inflectional paradigms, as the complete nominal morphology of Turkish is a subject too broad

to cover here (set aside the complete Turkish morphology). Once a solution for the nominal

morphology is designed however, it could be easily extended to cover the other major word

classes in a language. I will try to approach the task as modular as possible, so that if changes

or extensions are required, all that is needed is to plug in the extension component and

occasionally do a little tune up of the system. The key concept here is modularity.

My work is based primarily on Geoffrey Lewis’ Turkish (Lewis, 1989) and Turkish Grammar

(Lewis, 1967), referred to as the official language guides for Turkish in most papers. For the

purpose of this project I will be using the Xerox Finite State Toolbox (XFST) and the

“manual” to it by Lauri Karttunen (Karttunen, 2003).

In Section 2 I will roughly present the background information needed to proceed through the

paper as follows: Section 2.1 – linguistic background on Turkish; Sections 2.2 and 2.3 provide

some technical background on the technology employed and the particular toolbox I have

chosen to use. The actual model and its implementation will be presented in their full beauty

in Section 3. We conclude in Section 4 and in Section 5 I will present an outlook on possible

future elaborations.

2. Background In the following sections I will present the basic “technical” properties of the language and the

technology used to model it.

2.1 Turkish

In this subsection I will present the most important features of Turkish that we’ll be dealing

with in the subsequent sections.

Turkish is an agglutinative language from the family of Turkic languages. A Turkish word

consists of a root (base form) and a number of suffixes attached to it, each extending its

meaning or changing its word class:

8

(2) bilgi – knowledge

biglisiz – without knowledge

bilgisizlik – lack of knowledge

bilgisizlikleri – their lack of knowledge

bilgisizliklerinden – from their lack of knowledge

bilgisizliklerindenmiş – I gather that it was from their lack of knowledge

(Lewis, 1989. pp. 3)

As one might infer, many ideas typically expressed by prepositions or pronouns across

languages are expressed by suffixes in Turkish.

Another important feature of the Turkish language is vowel harmony. Vowel harmony is

basically described as a “progressive sound assimilation” phenomenon. In simple words, the

features of a vowel depend on the features of the preceding vowel. We’ll be dealing

exclusively with the vowel harmony of suffixes in Turkish and as mentioned before, the scope

of this project will be restricted to inflectional noun suffixes only.

Geoffrey Lewis (Lewis, 1989) describes the vowel harmony in Turkish with a general law of

vowel harmony in terms of the feature +/-back of vowels. The Turkish vowel system is

shown in table 2.1 below:

Unrounded Rounded

Low High Low High

Front a ı o u

Back e i ö ü

Table 2.1: The vowel system of Turkish.

As stated in (Lewis, 1989), all the vowels in a word agree with the backness value of the first

vowel of that word:

(3) +Back -Back

sekiz – eight dokuz – nine

seksen – eighty doksan – ninety

sinir – nerve sınır – frontier

sinirler – nerves sınırlar – frontiers

sinirlerimiz – our nerves sınırlarımız – our frontiers

(Lewis, 1989. pp. 11)

In cases of disharmony

1 in the root or if an invariable suffix is attached, the harmonic suffixes

harmonize with the vowel of the last preceding syllable. So attaching the plural suffix -ler/

-lar, which harmonizes for backness, to anne (mother) will result in anneler (mothers) and

not in *annelar, harmonizing with the vowel of the first syllable.

1 Exceptions to this principle are: a small number of native Turkish words – elma (apple), anne (mother),

kardeş (brother or sister); eight invariable suffixes; compound words – bilgisayar (computer), from bilgi

(information) and sayar (counter, lister); loanwords. Clements and Sezer account for them in (Clements, 1982)

9

There is also, as Lewis (Lewis, 1989) refers to it, a “special law of vowel harmony”, that

constrains the occurrence of vowels in terms of roundedness1. Unrounded vowels are typically

followed by unrounded vowels and rounded vowels are typically followed by low unrounded

or high rounded vowels.

Combining the two principles we end up with the following:

(4) a is followed by a or ı

e is followed by e or i

ı is followed by a or ı

i is followed by e or i

o is followed by a or u

ö is followed by e or ü

u is followed by a or u

ü is followed by e or ü

Turkish suffixes, except the eight invariable ones, harmonize with, for the sake of simplicity,

the vowel of the last syllable of the word they are attached to. They could be divided in two

groups: The vowels of the first group alternate between the low unrounded vowels a and e

(also called e-type2 suffixes (Pollard, 1996)) and the vowels of the second group alternate

between the high vowels ı, i, u and ü (the so-called i-type1 suffixes (Pollard, 1996)). Except

one – the present tense verbal suffix –iyor/–ıyor/–uyor/–üyor, no other suffixes contain o

and ö. (4) above provides some basic notion about this classification. The plural suffix -ler/

-lar falls in the first class, whereas suffix like the definite objective case suffix is an i-type

suffix.

(5) ev (house) evler (houses) evi (the house)

kol (arm) kollar (arms) kolu (the arm)

kitap (book) kitaplar (books) kitabı (the book)

köprü (bridge) köprüler (bridges) köprüyü (the bridge)

One might notice a few addtional things from (5). First of all no vowel sequences are possible

in Turkish. Exceptions are some loan words like saat (hour). Typically a buffer y is inserted if

a suffix begining with a vowel is attached to a word ending in a vowel. In some cases it is a n

or an s. Second, words in Turkish typically end in voiceless consonants, but they do change to

voiced ones intervocally. This topic, allong with the other alternations occuring in the process

of suffixation will be further elaborated in Section 3.2.2.

These are the general morphological and phonological features of Turkish that we will pay

attention to. In Section 3.1 and 3.2 I will present the actual morphotactics of the Turkish

nominal inflectional paradigm and the phonological alternation rules respectively.

1 Exceptions to this principle will be: tapu (title-deed), avuç (hollow of the hand), abuk sabuk (nonsensical),

çamur (mud) – in general a can be followed by u if a p, v, b or m intervenes. These exceptions occur apparently

only root-internally and do not seem to affect suffixation: kitap (book) → kitabı – (book, definite objective case

– the book). 2 The e-/i-type distinction is really a distinction between harmonizing vowels and not suffixes as Pollard

(Pollard, 1996) proposes. Some suffixes like the 3pPl Poss. –leri/-ları feature both types of harmonizing vowels.

10

2.2 Finite State Technology

Finite state technology was quickly condemned by the linguists at the earlier stages of its

development due to its weak descriptive power. But later on it proved to be quite useful for

modeling parts of languages that could be considered finite and regular. Various tasks are

nowadays approached using finite state technology – part-of-speech disambiguation,

tokenization, shallow parsing.. But the most significant and core application of finite state

technology in NLP remains morphological analysis. It is the basis for any further kind of

natural language processing.

The basic idea behind finite state technology is a set of states, with different properties and set

of arcs that connect these states. Arcs have a direction and an input symbol. That is, for a

particular state there is a set of outgoing arcs with their respective input symbols. The states

and arcs together form networks1.

2.2.1 Finite State Automata (FSA)

Finite state networks typically have one start

state and one or more final states. Transitions

between the states are possible only if the

required input is recognized. The sequence of

transitions over arcs to a particular state is

called a path. In the above example there are

two paths possible to the final state 3. In

order to accept a string, at the end of the

input the network should be in a final state.

Valid inputs for the network in Figure 2.1 are

b and ab, but not a by itself.

For the slightly more complicated network in

Figure 2.2, valid input sequences will be: b,

ab, bcb, abcb, bcab, abcab… Because of the

looping arc through c, we end up with an

infinite set of acceptable input strings. All

the possible input strings in this case seem to

follow a particular (regular) pattern.

Enumerating all the inputs seems

unreasonable. We’d rather define some rule

that selects valid inputs. A more compact

representation could be defined using

regular expressions. A regular expression

(or a regex) is a pattern that matches a set of

strings which obey particular syntax rules. It

1 We will be talking about networks here as a general term abstracting over transducers and automata. Automata

are finite state machines that only accept a set of given strings (a language), whereas transducers provide a set of

outputs for an accepted input, which might as well be identical to the input. Automata describe languages,

whereas transducers express relations between languages.

b

b 2 1

3

Figure 2.1: A simple three-state network. The

state marked with and arrow (1) is the start state,

the state marked with a double circle (3) is the

final state.

a

b c

b 2 1

3

Figure 2.2: A bit more complicated three-state

model. The arc with input c takes us back to the

start state creating a loop.

a

11

is an essential concept in Finite State Technology. Regular expressions describe the languages

accepted by Finite State Automata – the regular languages. In the current state, regular

expressions are only partially related to real regular expressions. There are newer operations

defined in every particular toolbox, extending its capabilities and expressive power. The

precise syntax varies among applications and toolboxes. I will describe the necessary syntax

basics in further detail, in terms of the toolbox I am using in section 2.3. A model solution for

the above networks using the lexc language is provided in the appendix.

2.2.2 Finite State Transducers (FST’s)

A Finite State Network (or a Finite State Machine), as noted above, is the general term for

Finite State Automata (FSA) and Finite State Transducers (FST’s). Where FSA deal with

acceptance/recognition only, FST’s also provide output(s) for the recognized input. This

major difference is described using symbol pairs in the model in Figure 2.3 below:

For an input string like ab the output will be AB, for abcb – ABcB, and so on. It seems like a

simple replacement operation, but there is no such operation involved here. In this case we

have strings from one language (later on referred to as the ‘UPPER’ language1) related to

strings from another language (which will be called the ‘LOWER’ language1). The c which

remains unchanged is applied the identity relation.

These are the basics. Once we have designed a network describing a language or a relation,

we can apply different operations to it – intersection (&)2, union (|), concatenatenation ( ),

negation (~), subtraction (-), composition (.o.), etc. The essential terms will be explained as

needed as we proceed. Most important to note here is the composition operation (.o.). A

general feature of Finite State Networks is that they can be composed together yielding a

sequence of transducers/ automata – a modular structure that is very essential to our purpose

in this paper. Composition is an operation on two relations. Say we have the transducer above

(Figure 2.3) that is turning lowercase a’s and b’s into upper case A’s and B’s respectively.

This could be further described as <a,A> and <b,B> in terms of relations. Say we have then

another transducer that is turning capital A’s and B’s into numbers, <A,1> and <B,2>.

Composing the two of them would provide us with a new transducer taking the upper side of

the first and the lower side of the second transducer, where the inner symbols match:

(6) [<a,A>, <b,B> ] .o. [<A,1>, <B,2>.] → [<a,1>, <b,2>.]

1 The terms will be explained in more detail in section 2.3 2 The operators and their syntax vary among toolboxes. I will be using the ones described in (Karttunen, 2003)

b:B c

b:B 2 1

3

Figure 2.3: A Finite State Transducer. It accepts the same strings as the FSA in Figure 2.2, but transforms

the lowercase a’s and b’s into upper-case A’s and B’s respectively. The c’s remain unchanged.

a:A

12

All the operations can be applied multiple times to different networks. For some of them the

order matters, for others not. Composition allows us to build a cascade of multiple transducers

into a single transducer, in terms of the current task at hand, compose multiple rule

transducers into a single lexical transducer that is relating strings from the language of surface

forms to strings from the language of lexical (underlying) forms. It was C.D. Johnson

(Johnson, 1972) who first realized that morphophonological knowledge could be modeled

using FSN’s. The most fascinating part is, once we have constructed a transducer for

morphological generation, we can easily apply it in the other direction for the task of

morphological analysis. This natural feature of finite state networks is what makes them so

suitable for morphological processing.

I will spare the mathematical model behind Finite State Networks, as it won’t be necessary to

understand the current paper. For further information on finite state technology and automata

theory refer to (Hopcroft, 1979).

2.3 XFST

The Xerox Finite State Toolbox (XFST) was developed at the Xerox Research Centre Europe

(XRCE) by Kenneth R. Beesley and Lauri Karttunen. It implements the standard finite state

operations such as composition and union as well as several innovative operations like

replacement rules1 and local sequentialization. XFST includes: lexc - a complier for lexicons

in the lexc language, which is specifically designed for handling morphotactics in natural

languages, and xfst – the core tool providing interface to the finite state calculus for building,

accessing and manipulating Finite State Networks and compiler for regular expressions and

replacement rules which will be essential to my work. Additionally, there is a compiler for

two-level morphology rules (twolc) as described by Koskenniemi (Koskenniemi, 1983), but

its application is beyond the scope of my work, so I will leave it aside. XFST also provides

two tools, lookup and tokenize, designed for testing and application of larger projects, but

they won’t be discussed any further in this paper.

In the process of implementing a morphological analyzer, the morphotactics will be defined in

lexc as supposed, whereas phonological/orthographical alternation rules will be defined as

separate transducers (mostly using replacement rules), composed together into a single

transducer, which itself will be composed with the network derived from the lexc definition of

the lexicon to finally result in a lexical transducer which will be used for our final purpose.

Additional transducers can be composed to the network at hand to impose restrictions, define

alternations or add more content.

XFST defines transducers as relations between two languages. What would be referred to as

upper language, could be thought of as the input and the lower language would then be the

output when we apply an input to a transducer downwards. If we apply input to the transducer

upwards then the roles switch – the input is applied on the lower side and the output comes

from the upper side. Although it seems a bit confusing, the terms upper and lower remain

constant. In the definition of a lexical transducer, the upper side language will describe the

lexical (underlying) forms of the language to be analyzed and the lower side language will

contain the actual surface forms in the standard orthography.

1 A brief overview of the formalism is available in the appendix.

13

3. The Model In this section I will present the nominal paradigm of Turkish and my implementation of it.

There are two modules in the model – the lexicon defining the morphotactics of Turkish

nouns and the morphophnonological rules component describing the alternations occurring on

the surface. In Sections 3.1 and 3.2 I will present the theoretical background behind my

model. An important notion in the following sections will be that of archiphonemic

descriptions. As I was implementing the vowel harmony principles using variables for the

alternating vowel segments, I realized that the idea of using variables could be further

employed to describe other phenomena, such as the consonant alternations. My initial

approach, using consonant alternation rules on the surface forms failed to describe the

exceptional cases, so I had to redesign it using unspecified abstract definitions on the lexical

side for entries that do undergo the alternations and underspecify the entries that do not. The

general idea: I will be using both in theory and practice the so-called archiphonemes to

describe classes of similar phonemes that alternate depending on the environment. For

example, to describe vowel harmony I will be using “I” to generalize over the class of high

vowels that alternate according to the principle of i-type vowel harmony and “E” to

generalize over the class of low unrounded vowels that alternate in concordance with the

principle of e-type vowel harmony. The symbols denoting the particular classes of alternating

phonemes will be defined as needed as we proceed further.

3.1 The Nominal Paradigm of Turkish. Morphotactics

The nominal inflectional paradigm is defined in different ways in the various sources. The

basic pattern on which everyone agrees though is:

STEM – NUM – POSS – CASE

Turkish has no distinction of grammatical gender. Worth mentioning is that in some sources,

the relativising suffix –ki is classified as part of the nominal inflectional paradigm. At the

current stage of development I won’t be concerned with it however. On the other side, case-

type suffixes are also differently defined in the various sources – in some of the recent works,

the suffix –(y)la/–(y)le is classified as an instrumental case suffix. We’ll get back to this issue

in the subsequent sections.

So let’s have a closer look at the core of the Turkish noun paradigm. The definition will be

further extended in the subsequent sections.

NUM

0

STEM

1 3

Figure 3.1: A simplified FSA model for the nominal morphotactics in Turkish.

2 4

0

5

0

CASE POSS

14

3.1.1 Inflection for Number

The basic uninflected dictionary form of Turkish nouns is singular (or as claimed in some

sources – “numberless”). The plural form is derived by attaching the –ler/–lar suffix. It

comes generally before any other inflectional suffix. Its vowel is of e-type harmony, therefore

the compact representation using an archiphonemic description will be –lEr. Ketrez (Ketrez,

2003) provides an extensive study on the multiple readings of the Turkish plural morpheme,

but it is mostly from syntactic and semantic points of view and I won’t go any further

discussing the issue.

3.1.2 Case Inflection

Lewis (Lewis, 1967, 1989) defines six cases in his grammar of Turkish. Table 3.1 below

provides an overview of the case paradigm in Turkish:

Case\Last preceding vowel e or i ö or ü a or ı O or u

Absolute (Nominative) - - - -

Definite Objective (Accusative) -(y)i -(y)ü -(y)ı -(y)u

Genitive (of) -(n)in -(n)ün -(n)ın -(n)un

Dative (to, for) -(y)e -(y)a

Locative (in, on, at) -de -da

Ablative (from, out of) -den -dan

Table 3.1: Summary of case suffixes in Turkish.

The bracketed y and n are realized on the surface only if the word the suffix is attached to

ends in a vowel. The locative and ablative suffixes are generally realized as –de/–da and

–den/–dan, but when attached to a word ending in a voiceless consonant (ç, f, h, k, p, s, ş and

t), they are realized as –te/–ta and –ten/–tan respectively. So using archiphonemic

descriptions and the principles of vowel harmony, the case inflection summary will look like:

Case Lexical Form of the Suffix

Absolute (Nominative) -

Definite Objective (Accusative) -(y)I

Genitive (of) -(n)In

Dative (to, for) -(y)E

Locative (in, on, at) -DE

Ablative (from, out of) -DEn

Table 3.2: Summary of case suffixes in Turkish using archiphonemic descriptions.

A few examples will be:

(7) araba → araba-(y)I → arabayı

(car, Nom.) (car, Acc. / LF) (car, Acc. / SF – “the car”)

ev → ev-DE → evde

(house, Nom.) (house, Loc. / LF) (house, Loc. / SF – “in the house”)

15

As mentioned above, some more recent works treat what used to be (and I believe still is) a

postposition (ilE) following absolute or genitive forms as an additional instrumental/

comitative case suffix (–(y)lE). It is however, still used, as far as my knowledge reaches out,

both as a postposition and as a cliticized suffix. I will stick to the classic works for now and

treat it as a separate (non-case) suffix1.

3.1.3 Inflection for Possession

Where in many languages possession is formed using pre-/post-posed pronouns (English:

my/mine, your/yours, his, her/hers, etc.; German: mein (my), dein (your), sein (his), ihr

(her), etc.; Bulgarian: pre-posed – мой ([moy] - my), твой ([tvoy] - your), негов ([negov] -

his), неин ([nein] - her)…; post-posed:.ми ([mi] – my, “of mine”), ти ([ti] – you, “of

yours”), му ([mu] – his, “of his”), и ([i] – her, “of hers”), etc.), in Turkish possession is

expressed by suffixes. The complexity of the possessives varies across languages, depending

on their overall morphological complexity. In Bulgarian for example, the pre-posed

possessives act pretty much like adjectives and typically precede them, so they carry the

inflection for gender, number and definiteness. In Turkish the possessive suffixes are partially

derived from the present tense forms of the verb to be. A summary of the possessive suffixes

is presented in Table 3.3 below:

Person Suffix Gloss

1pSg -(I)m my

2pSg -(I)n your

3pSg -(s)I his/her/its

1pPl -(I)mIz our

2pPl -(I)nIz your

3pPl -lErI their

Table 3.3: Summary of possessive suffixes in Turkish using archiphonemic descriptions.

Again, the bracketed segments surface only in particular conditions. Opposite to the case

suffixes, where the bracketed segments surfaced only if the word they are attached to ends in

a vowel, here the optional segments surface both if the word the possessive suffix is attached

to ends in a consonant (for the first and second person singular and plural) and if the word

ends in a vowel (for the third person singular). So we have vowel deletion in one case and

consonant insertion in the other, to avoid vowel sequences2.

(8) ev → ev-(I)m → evim

(house) (house, 1pSg Poss. / LF) (house, 1pSg Poss. / SF – “my house”)

araba → araba-(I)mIz → arabamız

(car) (car, 1pPl Poss. / LF) (car, 1pPl Poss. / SF – “our car”)

araba → araba-(s)I → arabası

(car) (car, 3pSg Poss / LF) (car, 3pSg Poss. / SF – “his/her car”)

1 Lewis (Lewis, 1967, 1989) states that it is attached to nominative nouns and genitive pronouns, in this sense it

could be considered an additional case suffix. I will leave it aside until I get a clearer view on the issue. 2 More on vowel sequences to come in the description of the rules in the following sections

16

Possessive suffixes precede case suffixes. By having another look at the two inflectional

paradigms one might or might not notice that some of the suffixed forms could occasionally

overlap on the surface. For example: the underlyingly different ev-(y)I (house – Definite

Objective (Accusative) case, “the house”) and ev-(s)I (house – 3pSg possessive, “his house”)

end up absolutely the same on the surface – evi:

(9) ev → ev-(y)I → evi

(house) (house, Acc / LF) (house, Acc. / SF – “the house”)

ev → ev-(s)I → evi

(house) (house, 3pSg Poss. / LF) (house, 3pSg Poss. / SF – “his/her house”)

Things get further complicated if there are multiple instances of the plural suffix –lEr – in the

case of 3pPl possessive for example, if the possessed noun is already plural – evler (houses)

→ *evlerleri → evleri (their houses) – one –lEr gets deleted. So we end up having the single

form evleri for both “their house” and “their houses”. Paying a closer look however, reveals

even further complications: evleri could also denote the accusative case of the plural of

houses (“the houses”) and the 3pSg possessive of the plural of houses – “his/her houses”.

Even though Turkish is morphologically highly specified, we often have 2-,3- or as in this

case 4-fold ambiguities. The derivations from the underlying lexical representations of the

four interpretations of evleri are given in (10) below:

(10) Pl.Acc . Pl.3pPl.Poss. Sg.3pPl.Poss. Pl.3pSg.Poss. (the houses) (their houses) (their house) (his/her houses)

ev-lEr-(y)I ev-lEr-lErI ev-lErI ev-lEr-(s)I

ev-ler-I ev-ler-leri ev-leri ev-ler-i

evleri evleri evleri evleri

Worth to note, just to make things even more confusing, is that after the third person

possessive suffixes, a so-called “pronominal n” is added when there is a case suffix following.

(11) evi (his/her house, also the house)

but:

(12) evinde (in his/her house in our case, but also identical with in your house)

Confusing? Typically ambiguities are resolved by looking at the context where the ambiguous

word occurs – ambiguous forms are usually used with the genitive of the personal pronouns to

avoid confusion. In this case the noun itself reverts to accusative case.

(13) evleri (their house) → onların evi (their house, “the house of theirs”)

(house, 3pPl.Poss) (they, Gen.; house, Acc.)

evleri (their houses) → onların evleri (their houses, “the houses of theirs”)

(houses, 3pPl.Poss) (they, Gen.; houses, Acc.)

evleri (his houses) → onun evleri (his houses, “the houses of his”)

(houses, 3pSg.Poss) (he, Gen.; houses, Acc.)

17

For the purpose of this project, however, I won’t be concerned with morphological

disambiguation, as this task should be performed at a later stage, after examining the already

analyzed context. There are further distinctions in the uses of the possessives in Turkish, but

again, this topic is beyond the scope of my work.

As one might imagine, for a single entry in the lexicon, that is for a single noun stem, there

are plenty of possible inflections - 2x for number times 7x (the six possessive suffixes + the

possession free form) for possession times 6x (or even 7x if the instrumental case is included)

for case inflection, results in 84 basic forms from inflection only (even though some of them

might be identincal), and things get further complicated.

3.1.4 Lexical Exceptions – the ‘su’ case There is only one pure lexical exception to the paradigm – the noun su (water). There is

however a large number of derived noun roots that end in –su, for example: akarsu (river –

“running water”). For this reason, it deserves a special treatment.

The exception manifests itself as su taking the -yun suffix for the genitive (instead of the

standard –nun suffix) and also, in the possessive forms, there is always a y preceding the

possessive suffix – suyum (my water) instead of *sum, suyu (his/her water) instead of *susu.

In general, the y is inserted whenever a suffix starting in a vowel or dropping consonant is

attached to the word.

3.2 Phonological Alternation Rules

In the following subsections I will outline theoretically the basics of the phonological

alternation rules in Turkish with respect to the task at hand.

3.2.1 Resolving Vowel Harmony The vowel harmony principles as described in Section 2.1 are rather simple to implement. I

will present how the basics work and then address some of the exceptional cases.

I split the two harmony classes in two rules – for e-type harmony and for i-type harmony. The

e-type harmony rule checks the value backness feature of the last preceding vowel – if it is a

back vowel the underlying E is realized as a, if it is a front vowel, it is realized as e. Since the

system does not provide us with feature specification of phonemes, I had to define the classes

of vowels as sets:

(14) define BackV [a | ı | o | u];

define FrontV [e | i | ö | ü];

define LowV [a | e | o | ö];

define HighV [ı | i | u | ü];

define UnroundedV [a | ı | e | i ];

define RoundedV [o | ö | u | ü];

The intersection (&) of those sets provides us with the sub-classes of vowels having

combined features. So the set of back unrounded vowels will be derived as:

18

(15) [ BackV & UnroundedV ]

which results in:

(16) [ a | ı ]

This is essential for defining the i-type harmony, as it is based on two features rather than one,

namely backness and roundedness. So, if the last preceding vowel is back and unrounded, the

underlying I is realized as ı (or the hıgh back and unrounded vowel – so to say intersecting the

set of high vowels with the sets describing the features of the last preceding vowel). The same

holds for the other realizations of the undelying I:

(17) I → [HighV & BackV & RoundedV] || [BackV & RoundedV] Consonant1 _

Which should be read as: I is realized as the high-back-rounded vowel (u) in the context of a

back rounded last preceding vowel (o or u). The other rules are identical:

(18) I → [HighV & BackV & UnroundedV] || [BackV & UnroundedV] Consonant _

I → [HighV & FrontV & RoundedV] || [FrontV & RoundedV] Consonant _

I → [HighV & FrontV & UnroundedV] || [FrontV & UnroundedV] Consonant _

This is only necessary to state clearly the principles operating vowel harmony. One migh as

well simply write the rules as: I -> i || [ i | e] Consonant _ , but that won’t have much of a

descrıptıve liguistic value.

In my solution the rules operate in parallel locally, that is for the e-type and the i-type they

operate together among themselves, but the e-type harmnoy still has precedence over the i-

type. The reason behind it – apart from the backness harmony being the more general

principle and having broader coverage, the abstract symbols have to be resolved in a left-to-

right fashion and e-type suffixes at the current stage precede i-type suffixes. We need the

exact properties of the last preceding vowel in order to resolve the next variable vowel in the

following (or even in the same suffix). In this sense, I might need to combine the e-type and i-

type rules into one single rule operating in parallel as the system gets more sophisticated2.

A few words about the exceptions to vowel harmony: We will be concerned with roots whose

last vowel does not have predictive power over the harmonic features of the suffixes attached

to it. Schaaik (Schaaik, 1996) refers to words which induce such exceptions as “disharmonic

roots”. The same term however is used in some sources for roots that do not conform the

principles of vowel harmony internally – the already mentioned in section 2.1 exceptional

cases like anne (mother), çamur (mud), etc. Although they often do overlap, it can’t be stated

that this is always the case. The exceptions we will be dealing with are mostly of foreign

origin: alkol (alcohol), rol (role), saat (clock), etc. are realized as alkolü (alcohol, Acc), rolü

1 Consonant is also a defined class featuring all the consonants 2 A small issue that occured when I accidently switched the order of the rules was that for example in words

having a round vowel in their last syllable (like katalog (catalogue)) were resolved in an unusual way -

*kataloglarunuz, whereas the correct form would be kataloglarınız (our catalogues). This was due to resolving

the –InIz (1pPl Possessive suffix) as –unuz in concordance with the last (resolved) preceding vowel o (the E in

the plural suffix –lEr was still pending resolution). This is important, because if a e-type suffix is added, all the

following suffixes feature unrounded vowels (unless a suffix with an invariable rounded vowel is added).

19

(role, Acc.), saati (clock, Acc.) and alkoller (alcohol, Pl.), roller (role, Pl.), saatler (clock,

Pl.) instead of *alkolu, *rolu, *saatı and *alkollar, *rollar, *saatlar respectively in their

accusative and plural forms.

3.2.2 Consonant Alternation Rules As mentioned in Section 2.1, word final consonants undergo particular alternations depending

on the environment. For the purpose of this project, I used archiphonemic descriptions for the

alternating segments. In most of the related works they pick the capital letter for the voiced

phoneme (B for b and p, D for d and t, etc.) or a capital for the geminating phoneme (S for ss

and s, etc.). I will stick to the standard notation to avoid unnecessary confusion.

The above abstraction is necessary to model the exceptions to these alternation rules. We will

pay some attention to the exceptions in the end of the section. My approach to this issue is

partially based on the paper by Sharon Inkelas and Orhan Orgun (Inkelas, 1997) in which

lexical exceptions are treated in terms of Optimality Theory. In brief: the alternating word

final consonant in regular roots that undergo the alternations will be unspecified in the lexicon

using a special symbol and the exceptional cases will be underspecified with their non-

alternating surface realizations so that they won’t trigger the alternation rules.

3.2.2.1 Final Consonat (De)Voicing

The final consonant voicing occurs when a suffix starting in a vowel or a dropping consonant

is attached to the stem. It covers the voiceless plosives p, ç, and t, which transform into their

voiced counterparts. Additionally, what is often classified separately as a “K/0”1 alternation

(namely because of the subclass of velar consonants k, g and ğ that exhibit similar behavior),

falls into this category as well. (19) Below provides basic notion about the alternations that

occur, where do they occur, and what do the archiphonemic symbols stand for:

(19) B → b || _ Vowel, otherwise B → p

D → d || _ Vowel, otherwise D → t

C → c || _ Vowel, otherwise C → ç

K → k || _ Vowel, otherwise K → ğ

G → g || _ Vowel, otherwise G → ğ

Q → k || _ Vowel, otherwise Q → k

So far it seems fine as far as alternations in the stems are concerned. But similar alternations

occur in suffixes as well. They are dependent on the preceding phoneme and assimilate the

value of its voicing feature. So we have2:

(20) B → p || VoiceLessCons _ , otherwise B → b

D → t || VoiceLessCons _ , otherwise D → d

C → ç || VoiceLessCons _ , otherwise C → c

1 K/O because the counterpart of k intervocally is the so called “yumuşak ge” (soft g), which is phonologically

realized as lengthening of the preceding vowel. 2 For the purpose of this project only the the d/t alternation will be actually used as it is the only one occurring in

the inflectional suffixes of nouns

20

An example for both phenomena where several rules apply, will be the inflection of kitap

(book) in Table 3.4:

Surface Form Lexical Form Alternation Rules Gloss

kitap kitaB B→p book, Sg, Nom.

kitaplar kitaB-lar B→p book, Pl, Nom

kitabım kitaB-(I)m B→b, I→ı book, Sg, 1pSg Poss, Nom.

kitapta kitaB-DE B→p, D→t, E→a book, Sg, Loc.

kitabimda kitaB-(I)m-DE B→p, I→ı, D→d, E→a book, Sg, 1pSg Poss, Loc

Table 3.4: Summary of the application of the phonological alternation rules.

The rules in (19) and (20) are oversimplified of course. In the actual implementation they

feature a wider context including morpheme boundaries to make the distinctions clearer.

In linguistic terms we have regressive assimilation in stems and progressive assimilation in

suffixes. The exceptions to these rules include primarily monosyllabic words that perserve the

quality of their final consonant. There are however monosyllabic words that do undergo the

alternation rules, as there are polysyllabic words that do not. Such exceptions will be

underspecified in the lexicon with their unchanging consonant.

3.2.2.2 (De)Gemination

Apart from the final stop voicing/devoicing, which is the most productive type of consonant

alternation a few other types of alternations are worth mentioning. The final consonant

(de)gemination occurs only in a small number of Arabic loan words. The nature of this

phenomenon is similar to the one of the final consonant (de)voicing – a word final segment

gets doubled if a suffix starting in a vowel (or dropping consonant) is attached to the word:

(21) his (feeling) → hissi (feeling, Acc., “the feeling”) → hisler (feelings)

hat (line) → hattı (line, Acc., “the line”) → hatler (lines)

Again, we will have to employ special symbols that will be realized differently on the surface

depending on the context as proposed by Schaaik (Schaaik, 1996)1. He proceeds even further,

investigating the dependence of these alternations on the re-syllabification processes

occurring with the different suffixes. I will not go into detail however, as my project is not

intended to feature a syllabification module in its current stage of development.

3.2.3 Other Alternations Two other alternations are worth mentioning for the sake of completeness. One of them

involves vowel insertion/deletion and the other describes the status of the glottal stop in

Turkish. The first one is rather common, whereas the second operates on a limited domain of

Arabic loan words. Both of them show some ambiguities.

1 This issue could be approached differently, by underspecifying the geminating stems with their double

consonants in the lexicon and then removing the additional segment if necessary.

21

3.2.3.1 Vowel Insertion/Deletion

Some stems in Turkish exhibit an interesting property of forming stem final consonant

clusters via vowel epenthesis:

(22) burun (nose) → burnu (nose, Acc. “the nose”)

fikir (idea) → fikri (idea, Acc. “the idea”)

şehir (city) → şehri (city, Acc. “the city”)1

ömür (life) → ömrü (life, Acc. “the life”)

alın (forehead) → alnı (forehead, Acc. “the forehead”)

This phenomenon occurs again whenever a suffix starting in a vowel is attached to the stem

(seems like all the stem-internal alternations in Turkish are conditioned on the same context).

The epenthesized vowel is always a high vowel, but its other features cannot be automatically

determined, so it has to be hard-coded. Such stems will be indicated in the lexicon with a meta

character preceding the vowel which is to be deleted.

As for the quality of the consonant clusters that are formed after the epenthesis occurs, there

have been several attempts to define the possible consonant sequences in such cases, but this

is far beyond the scope of this paper.

3.2.3.1 The Glottal Stop

This, along with the gemmination rule, is probably the most improductive rule in Turkish.

They both concern only a limited number of arabic loan words. The nature of the glottal stop

is not quite clear to me, however I attempted an approach based on Schaaik’s (Schaaik, 1996)

description and the Turkish Lexical Database Project (TLDP). Schaaik (Schaaik, 1996)

describes two types of glottal stop:

(23) Type 1: ^ -> 0 / ^ (0 if a consonant follows and ^ if a vowel follows)

cami^ (mosque) -> camiler (mosques)

-> cami^i (the mosque / his/her mosque)2

Type 2: ‘ -> i / ‘ (i if a consonant follows and ‘ if a vowel follows)

nev’ (sort) -> neviler (sorts)

-> nev’i (the sort / his/her sort)

(Schaaik, 1996, pp. 114)

Both are supposed to act as consonants if a vowel follows.

In modern Turkish however, the glottal stop is mostly omitted both in speech and writing. It is

preserved only when ambiguities occur – telin (of the wire / your wire) and tel’in

(denunciation). Apparently, in TLDP the glottal stop is not featured either. Both cases are

1 In modern Turkish, the tendency is to retain the i in şehir (city) – şehiri (the city) 2 The Type 1 glottal marker ^ is not manifesting itself orthographically.

22

accepted there – camii and camisi both denote the 3pSg Possessive form (his/her mosque),

identically camii and camiyi both denote the accusative case (the mosque). For the second

type though, only yeisi (the despair / his/her despair) and neviyi (the sort) / nevisi (his/her sort) are recognized. So the first type allows for both realizations, whereas the second type

behaves more or less as if it wasn’t there at all.

In my solution, I tried to approach the issue as in the TLDP. There are some mismatches

though, and even though it is more likely that the mistake is overgeneration from my side, it is

also possible that the TLDP analyzer has some flaws. The examples I am concerned with are:

(24) camim (mosque, 1pSg Poss. “my mosque”)

vs.

(25) camiim (mosque, 1pSg Poss. “my mosque”)

Analogous to camii and camisi (his/her mosque), they should both denote the same thing, but

the TLDP analyzer provides different solutions, where only the first one (camim) seems to be

proper. I have to investigate the issue further. For now, in my project they will both stand for

“my mosque”.

3.3 Implementation

The model comprises of two components – the lexicon, defined in lexc, describing the

morphotactics of Turkish (technically it is implemented as an FSA, but it does include some

transductions for the tags), and a set of rules, that describe the morphophonological

alternations that occur on the surface (implemented naturally by a set of FSTs in xfst, using

the formalism of replacement rules).

3.3.1 The Lexicon The lexicon network implemented in lexc describes the morphotactics of the Turkish nominal

inflection. First of all, there is a multicharacter symbols definition (26) where a set of

sequences of symbols that should be treated as atomic symbols is defined:

(26) Multichar_Symbols +Noun +Poss +Case +1p +2p +3p +Sg +Pl +DefObj +Gen

+Dat +Loc +Abl +Abs

These are primarily used to define the tags to be used (case marking, possession, number,

etc.). Further on, it contains a sub-lexicon of the noun stems – it is the simplest, but most

important part – it contains the noun stems in their lexical (underlying) form, which could be

automatically extracted from a dictionary. This form includes all the special symbols that

denote alternating segments and trigger the alternation rules. Then on the next stage (the

standard continuation class for all nouns) a tag +Noun is attached on the upper side, that is, it

visible only if morphological analysis (or lookup) is performed (same for all the other tags).

On the lower (surface) it is realized as an epsilon. The continuation class from there is the

number lexicon – number suffixes are attached on the lower (surface) side and tags +Sg and

+Pl are attached on the upper (lexical) side, (the dash “ – “ stands for morpheme boundary):

23

(27) LEXICON Number

+Sg:0 Possessive;

+Pl:-lEr Possessive;

A possessive sub-lexicon follows which defines the inflection for possession as described in

Section 3.1.3 with the appropriate tags. There is an intermediate lexicon however, that

specifies the optionality of the possessive suffixes:

(28) LEXICON Possessive

+Poss:0 PSuff;

+Case:0 CSuff;

That is, either take a possessive tag +Poss and go to the lexicon of possessive suffixes, or take

a +Case tag and go to the lexicon of case suffixes. So the actual sub-lexicon for the

possessive suffixes is called PSuff:

(29) LEXICON PSuff

+1p+Sg:-*Im Case; ! "my"

+2p+Sg:-*In Case; ! "your"

+3p+Sg:-*sIN Case; ! "his/her/its"

+1p+Pl:-*ImIz Case; ! "our"

+2p+Pl:-*InIz Case; ! "your"

+3p+Pl:-lErIN Case; ! "their"

After taking a possessive suffix there is again an intermediate stage that should be passed –

the possessive forms still have to take a +Case tag. In the morphological analysis module of

the Turkish WordNet® the possessive markup is obligatory. It is referred to as possessive

agreement there, and if there is none, then the tag is +Pnon. I don’t find it necessary for now,

but of course it won’t be any problem to tune my system up so that it features the same type

of mark-up.

Two more points to make clear: the optional segments which were marked with brackets in

the theoretical part are prefixed with an optionality marker (*); the pronominal n is denoted

by the capital N. Oflazer (Oflazer, 1995) defines it as a part of the case suffixes. In my case it

is an optional segment that surfaces only if there is a suffix following the third person singular

and plural possessive forms. In his case, there are two copies of each case suffix – one that

follows the third person possessive form and one for all the other possessive and non-

possessive forms. To me it seems more intuitive to have it as a part of the possessive, as it is

indeed a “pronominal n”, and I don’t find much sense in having two instances of every case

inflection.

(30) LEXICON CSuff

+DefObj:-*yI #; ! Definite Objective Case (Accusative)

+Gen:-*nIn #; ! Genitive Case - posessive, "of"

+Dat:-*yE #; ! Dative Case - (indirect object) "to", "for"

+Loc:-DE #; ! Locative Case - "in", "on", "at"

+Abl:-DEn #; ! Ablative Case - "from", "out of", "througn"

+Abs:0 #; ! Absolute (dictionary) form (Nominative)

24

The last component of our lexicon is the case inflection sub-lexicon. It is obligatory, as all

uninflected nouns are in their absolute form (Nominative case). The hash symbol (#) is an

anchor symbol denoting word boundary (in replacement rules it is circumfixed by dots (.#.)).

To summarize, a visual map of the lexicon network is presented in Figure 3.2 below:

3.3.2 The Rules Component

The rules component of the system is implemented as a sequence of composed transducers in

xfst using the formalism of replacement rules. It currently features 17 rules, of which 12 are

significant and 5 are just for cleaning up the markers1. The rules are composed in a particular

1 I prefer keep them apart in the development stage, as it often happens that I need to preserve some markers in

order to see what exactly has gone wrong in case of an error.

3.Number

1.Noun

2.NN

7.CSuff

Figure 3.2: Schematic visualization model of the lexicon network

6.Case 5.PSuff

0.Root

8. #

4.Possessive

/Noun

Stems/

0

+Noun:0

+Sg:0 ,

+Pl:lEr

+Poss:0

+Case:0

+Case:0

+1p+Sg:-*Im ,

+2p+Sg:-*In ,

+3p+Sg:-*sIN ,

+1p+Pl:-*ImIz ,

+2p+Pl:-*InIz ,

+3p+Pl:-lErIN

+DefObj:-*yI ,

+Gen:-*nIn ,

+Dat:-*yE ,

+Loc:-DE ,

+Abl:-Den ,

+Abs:0

25

sequence, as some of them do depend on each other. Full independence is hardly achievable.

In the case of a dropping vowel in the stem for instance, the vowel harmony rules have to

apply before the vowel is deleted, since the suffixes have to harmonize with this vowel. This

is especially true for monosyllabic roots that lose their one and only vowel. The rules are split

(for now) in several groups addressing the different phenomena types that they describe. A few classes needed to be defined in order to make the rules operational. I defined a class for

the vowels and consonants initially, where the consonant class had to be extended to feature

all the archiphonemic descriptions used. As already mentioned, the vowels are further divided

into subclasses according to their features for the vowel harmony resolution. Further on, for

the rule of progressive assimilation in suffixes, I had to define a class of voiceless consonants.

3.3.2.1 Vowel Harmony Rules

So far, the rules for e-type and i-type vowel harmony are split into two separate rules (which

operate in parallel among themselves), where the e-type precedes the i-type harmony

resolution, but they might need to be merged into a single rule operating in parallel on all the

harmonizing segments.

As mentioned above in Section 3.2.1, an underlying E is realized as e on the surface when the

last preceding vowel is a front vowel and as a when the last preceding vowel is a back vowel.

This defines the e-type harmony rule:

(31) E -> e || FrontV ~$[Vowel] _ ,,

E -> a || BackV ~$[Vowel] _

The dollar sign ($) has a special meaning in xfst – „contains“. The tilde (~) on the other side

stands for a complementation operator – negation (in this case: negation of the language that

contains vowels). In simple words the left context should be read as: there is a front vowel on

the left and between it and the symbol to be resolved (E), there are no other vowels. Same for

the second line, only that it concerns back vowels in the left context. A thing to mention, the

double commas (,,) in xfst replacement rules stand for parallel operation (as opposed to the

composition operator (.o.) which stands for sequential operation). In other words, this is a

two-level rule. Same for the i-type harmony rule, only that it considers two features (backness

and roundedness) of the last preceding vowel.

(32) I -> i || [FrontV & UnroundedV] ~$[Vowel] _ ,,

I -> ü || [FrontV & RoundedV] ~$[Vowel] _ ,,

I -> ı || [BackV & UnroundedV] ~$[Vowel] _ ,,

I -> u || [BackV & RoundedV] ~$[Vowel] _

As far as vowel disharmony in suffixes is concerned, the stems that induce such disharmony

will be marked as such (again this could be implemented as an automated procedure) by

inserting a (dis-)harmony marker after the last vowel of the stem. The disharmony marker

itself will be nothing more than the vowel that induces the new vowel harmony, prefixed by a

harmony marker (H). For example: alkol (alcohol) which transforms into alkolü (instead of

*alkolu) (the alcohol) and alkoller (instead of *alkollar) (alcohol, Pl) will be lexically

represented as alkoHül.

26

3.3.2.2 Consonant Alternation Rules

The most productive consonant alternation rule as described in Section 3.2.2.1 is the final stop

devoicing rule. Similar rules (both in operation and conditions) are the K/0 alternation rule

and the consonant germination rules. The suffix onset (de)voicing rule will also fall in this

category.

For these rules, as already mentioned, I had to use abstract symbols denoting the alternating

phonemes (just as in the vowel harmony rules). As we’ve already had an extensive overview

of the principles behind these rules I will not discuss them any further.

(33) Final Consonant Devoicing Rule:

[ B -> b , C -> c , D -> d || _ %- (%*) Vowel ] .o.

[ B -> p , C -> ç , D -> t || _ [[%- (%*) Cons] | .#.]]

Velar Alternation Rule:

[ G -> ğ, K -> ğ, Q -> g || _ %- (%*) Vowel ] .o.

[ G -> g, K -> k, Q -> k || _ [[%- (%*) Cons] | .#.]]

Suffix Onset Devoicing Rule:

[ C -> ç , D -> t , G -> k || VLCons %- (%*) _ ] .o.

[ C -> c , D -> d , G -> ğ || ~VLCons %- (%*) _ ]

Gemination Rule:

[ S -> s, T -> t || _ [%- Cons | .#. ] ] .o.

[ S -> [ s s ] , T -> [ t t ] ]

The percent sign (%) is used as an escape character in xfst to literalize characters that have a

special meanings otherwise. The anchor marker (.#.) is used to denote word boundaries (the

beginning of string if used on the left and the end of string if used on the right). The brackets

denote optionality in the regular expressions sense – (%*) in a replacement rule means “there

is a possible literal * there”. A few notes on the germination rule: it is a rather radical

approach as far as context is concerned, it could be improved though in case of failure; so far

it covers only the cases of geminating s and t. They were chosen at random out of the set of

eight geminating consonants in Turkish, just to implement the principle. For the remaining six

consonants, special symbols have to be chosen and their transformations need to be inserted in

the rule (pure mechanical operation). There are however some special cases, where

germination and devoicing occur simultaneously:

(34) muhip (friend) → muhibbi (friend, Acc., “the friend”)

and the even further complicated case of serhat (border), which is an exception to vowel

harmony, besides undergoing germination and voicing – serhaddi (border, Acc., “the

border”). This issue could be fixed using a few minor tricks and the current system is ready to

handle it, but I will leave it for a later stage of development.

27

3.3.2.3 Fixing the Morphotactics

The few next rules are used to “fix the morphotactics” – they deal with general phenomena

such as vowel/consonant deletion, the pronominal n, the lexical exception “su” and the

elimination of the multiple plural morpheme.

The rule for multiple plurals simply takes two adjacent plural morphemes and rewrites them

as a single morpheme, nothing unusual.

The rule for the pronominal n simply drops the N word finally (a tricky solution).

The rule for the glottal stop is again a tricky solution. As we’ve seen, the ^ marker is either

realized as underlying consonant or as nothing at all. My approach was to optionally delete it

if a vowel follows:

(35) [ %^ (->) [.0.] || _ %- %* ]

This way, both camii and camisi/camiyi will be recognized as described in Section 3.2.3.1.

Next, the rule for dropping stem vowels has nothing particularly interesting to it. Dropping

segments are prefixed by a literal dollar sign ($), so an underlying koy$un (bosom) will be

realized as: koyun (bosom) and koynu (the bosom) (as opposed to koyun (sheep) which is

realized as: koyunlar (sheep, Pl) and koyunu (the sheep)). The dropping segments remain if

the suffix attached starts in a consonant on the surface.

The rule for the exceptional class of words ending in “–su” (water) is again pretty simple, the

words ending in su are specified in the lexicon as suY (this is partially from the origin of the

word – historically it derives from “suw”). This special symbol is then realized as y in the

proper context or as epsilon by default. It took me quite a while to come to this idea. I was

happy to see that others have approached the issue in a similar way1.

The most complicated rule, and the one that took me the most time to design optimally (and

which is still under consideration whether it is the best solution or not) is the rule that

manages all the dropping consonants and vowels in suffixes (except the pronominal n). I

called it fixing the vowel sequences, as this is more or less what it is supposed to do. In the

case suffixes we have y and n insertion if the stem ends in a vowel. On the other side, we

have high vowel (I) deletion if the stem ends in a vowel and s insertion if the stem ends in a

vowel in the possessive inflectional paradigm. In simple terms, all these phenomena occur to

avoid vowel sequences. After quite a bit of thinking I dealt with all these phenomena in a

single blow:

(36) [? - HighV] -> 0 || Cons %- %* _ ] .o.

[ HighV -> 0 || Vowel %- %* _ ]

The above composition of two rules does two things, namely: 1. It deletes every segment that

is not a high vowel (I), marked up as optional, in the context of a preceding consonant across

a morpheme boundary and 2. It deletes every high vowel (I), which is marked up as optional

in the context of a preceding vowel across a morpheme boundary.

1 See (Schaaik, 1996).

28

The remaining rules clean up the marker leftovers. The clean up procedures can be

incorporated in the rules themselves, but during the development stage, I prefer to keep them

separated for debugging purposes.

3.3.2.4 Rule Order

A few notes on the current rule ordering.

FixMultiPlural

.o.

eTypeHR

.o.

iTypeHR

.o.

SUexception

.o.

PronominalN

.o.

FixGlottal

.o.

FixVowelSeq

.o.

FinalStopDevoicing

.o.

VelarAlternations

.o.

Gemination

.o.

SuffOnsetDevoicing

.o.

StemVDeletion

.o.

ClearMBMarker

.o.

ClearOptMarker

.o.

ClearSVDMarker

.o.

ClearGlottal

.o.

ClearExHarmony

First thing’s first, getting rid of the multiple plural morpheme is a good thing to start with.

There are some local dependencies among the rules, like already mentioned, the e-type

harmony rule has to precede the i-type harmony rule (or probably they will have to be merged

in a single rule and apply simultaneously as two-level rules). Also the vowel harmony

resolution shall precede the stem vowel deletion. If we proceed from left to right (with

parallel rules), the stem vowels will be deleted before the vowels in the suffixes which shall

Figure 3.3: The current rule ordering

29

harmonize with the deleted vowel are resolved. There should also be some tendency to go

from simpler and more general to more sophisticated and specific rules (either in upward or

downward direction). Such however is not present in the current stage of development. The

final stop devoicing, the velar alternations, the germination, stem vowel deletion and the rule

for the su exception could all operate at a single stage as they occur in identical contexts and

their purpose is more or less the same. The suffix onset devoicing rule is partially dependent

on the outcome of the final stop devoicing rule, but if the input is processed left to right, this

will be determined before the application of the suffix onset devoicing rule. The pronominal n

rule is also on its own, so getting the bigger picture, in the end it seems that the rules are

mostly independent. All that maters is to process the input sequentially, from left to right. And

therefore if we have the wrong rule ordering, rules that apply on segments that occur after

unresolved segments might cause major troubles. This is the reason why most finite state

approaches to Turkish morphology are based on two-level morphological descriptions.

4. Conclusions In order to analyze the complex and often symbiotic relations between words, one needs first

to determine the exact properties of each and every individual token. Some of the properties

however, could only be determined after examining the environment. The common approach

to this issue is “inside-out” (or bottom-up) – starting from the basic entities and building up

increasingly complex structures out of them. In this paper I presented an approach to part of

the “basic entities” in the Turkish language.

5. Future Work Where do we go from here on? One could come up with various ideas. I myself am not so

sure which way this project will take. First, before everything else, the model has to be

completed to cover the other major word category in Turkish, as well as the minor word

categories, to result in a full-featured morphological processor. Then perhaps, to extend

functionality a lexicon extraction routine has to be implemented, that automatically extracts

entities from a dictionary into the morphological processor. This could be combined with a

morphological guesser, and the two could form a symbiotic relation, in which the former will

be used to train the latter, and the guessing algorithm will occasionally provide substance for

the extension of the lexicon. Further on I am thinking of implementing a syllabification

module as it seems quite necessary, as well as perhaps stress markup. Having a fully

functional morphological processor at hand, there are various ways one could take: Integrate

it into a larger NLP system (speech synthesis/recognition applications, automatic machine

translation applications, language tutoring applications, artificial intelligence components,

OCR applications, supplemental linguistic applications); extend its functionality for different

tasks (a major advantage of the modular approach – simply add a new module for the task at

hand and occasionally tune up the existing modules); add a context component for

disambiguation (this falls in the previous category perhaps); try approaching a different

language, and numerous other options in the field. As a first step however, a complete

coverage of the language of choice has to be accomplished.

30

Bibliography: Dik, Simon C. 1981. Functional Grammar 3rd Ed,. Foris. Dordrecht. The Netherlands.

Clements, George N. and Engin Sezer. 1982. Vowel and Consonant Disharmony in Turkish. Linguistic Models:

The Structure of Phonological Representations (Part II), ed. by H. van der Halst and N. Smith. Foris Publishing,

Dordrecht, Holland.

Hankamer, Jorge. 1986. Finite State Morphology and Left to Right Parsing. Paper, 3rd International Conference

on Turkish Linguistics, August 1986, Tilburg, The Netherlands.

Hopcroft, J.E. 1979, Ullman, J.D., Introduction to Automata Theory, Languages and Computation. Addison –

Wesley.

Inkelas, Sharon. C. Orhan Orgun. 1997. The Implications of Lexical Exceptions for the Nature of Grammar.

Derivations and Constraints in Phonology. Roca, Iggy; Clarendon Press, Oxford. 1997.

Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton. The Hague. Paris.

Karttunen, Lauri, with Kenneth R. Beesley. 2003. Finite State Morphology. CSLI Publications. Stanford.

Ketrez, F. Nihan. 2003. Multiple Readings of the Plural Morpheme in Turkish. USC. USA. (online at:

http://www-scf.usc.edu/~ketrez/papers/ADL2003ketrez.pdf - 25.06.2005)

Koskenniemi, Kimmo. 1983. Two-level Morphology. A General Computational Model for Word-Form

Recognition and Production. Department of General Linguistics. University of Helsinki.

Köksal, A. 1975. A First Approach to a Computerized Model for the Automatic Morphological Analysis of

Turkish. Doctoral Dissertation, Hacettepe Universitesi, Ankara.

Lewis, Geoffrey. 1967. Turkish Grammar. Oxford University Press. Oxford.

Lewis, Geoffrey. 1989. Turkish 2nd ed. (Teach Yourself Books). Hodder and Stoughton. London.

Oflazer, Kemal. 1994. Two-level Description of Turkish Morphology. Linguistic and Literary Computing.

(online at: http://acl.ldc.upenn.edu/E/E93/E93-1066.pdf - 25.06.2005).

Oflazer, Kemal. Elvan Göçmen and Cem Bozşahin. 1995. An Outline of Turkish Morphology. Technical Report.

Middle East Technical University (online at: http://www.lcsl.metu.edu.tr/ftp/papers/morphspecs.ps.gz -

18.07.2005).

Pollard, Asuman Çelen; Pollard, David. 1996. Turkish: A complete course for beginners. (Teach Yourself

Books). Hodder and Stoughton. London.

Schaaik, Gerjan van. 1996. Studies in Turkish Grammar. Harrassowitz Verlag, Wiesbaden, Germany.

Sebüktekin, Hikmet I. 1971. Turkish-English Contrastive Analysis. Turkish Morphology and Corresponding

English Structures. Mouton. The Hague. Paris.

Useful links:

http://www.hlst.sabanciuniv.edu/TL/ - The Turkish Lexical Database Project - provides

morphological analysis to verify the results

http://www.turkishdictionary.net/ - Turkish online dictionary – additional glossary

http://www.google.com/ - Everything is there! – using the web as a corpus

31

Appendix A: List of Abbreviations

CASES:

Nom./+Abs - Nominative/Absolute

Acc./+DefObj - Accusative/Definite Objective

Dat./+Dat - Dative

Gen./+Gen - Genitive

Loc./+Loc - Locative

Abl./+Abl - Ablative

NUMBER/POSSESSIVE:

Sg./+Sg - Singular

Pl./+Pl - Plural

(+)1p/2p/3p - 1/2/3 Person

Poss./+Poss - Possessive

GENERAL:

FST - Finite State Transducer

FSA - Finite State Automaton (-ta)

FSN - Finite State Network

LF - Lexical Form (lexicon entry form)

SF - Surface Form (standard orthographical representation)

32

Appendix B: lexc Code Samples

!##############A lexc solution to the network in Figure 2.2########################

LEXICON Root !#The start state so to say. Every lexicon needs it.

One; !#A line in lexc has two components:

!#1. An expression (which could be as complex as needed)

!#2. A continuation class

!#Think of the expression as the symbol over the arc and

!#the continuation class as the destination state

Lexicon One !#Figuratively speaking – State 1

a Two; !#The two arcs with the respective input symbols and destinations

b Three;

Lexicon Two !#State 2

b Three;

Lexicon Three !#State 3

#; !#The hash symbol denotes end of input, or a final state

c One; !#The loop back to State 1

!#################A model lexc solution for Figure 2.3###########################

!#Same as above for the most part

LEXICON Root

One;

Lexicon One

a:A Two; !#The semicolon operator denotes a transduction here

b:B Three; !#Basically the expressions could be regular expressions

!#with varying complexity, combining various operations,

!#but as my key concept is modularity, I will try to keep

!#them as simple as possible.

Lexicon Two

b:B Three;

Lexicon Three

#;

c One;

33

Appendix C: On Replacement Rules

Replacement rules are simply intuitive and convenient shorthands for more complex regular

expressions. The most general shape of a context-free replacement rule is:

A->B

where A and B are regular languages (which could be arbitrarily complex regular expressions

themselves). In this case, every string from the upper language (the universal language1) is

mapped to itself, except that whenever a substring from A is encountered, it is related to a

substring from B (opposed to normal transducers where if the input string doesn’t match a

string from the upper language, nothing happens and there is no output). This formalism is

further extended to include context:

A->B || L _ R

where A, B, L and R all denote languages and not relations (both L and R are optional). What

happens here is essentially the same as above, only that the languages A and B are further

contextually restricted. A substring from A is related to a substring from B, only if it is

preceded by a substring from L and followed by a substring from R. The double vertical bars

separate the rule(s) from the context. Different rules operating in the same context are

separated by a comma:

A->B , C->D || L _ R

The same is valid for contexts:

A->B || L1 _ R1 , L2 _ R2

Replacement rules could be constructed to operate in parallel (as in two-level models) using

double comma (,,) separator:

A->B || L1 _ R1 ,, C->D || L2 _ R2

Or composed as standard networks:

[A->B || L1 _ R1] .o. [C->D || L2 _ R2]

The difference is crucial if the rules are dependent on each other.

These are the basics. For more information on XFST and its replacement rules refer to

(Karttunen, 2003).

1 The language of all possible strings.

turkish nomina

Documents