finnish, turkish and hungarian

Upload: hakan-yazici

Post on 03-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Finnish, Turkish and Hungarian

    1/12

    arXiv:cs

    .CL/0006044v13

    0Jun2000

    Finite-State Non-Concatenative Morphotactics

    Kenneth R. Beesley and Lauri [email protected], [email protected]

    SIGPHON-2000, Proceedings of the Fifth Workshop of the ACLSpecial Interest Group in Computational Phonology, p. 1-12,

    August 6, 2000, Luxembourg.

    Abstract

    Finite-state morphology in the general traditionof the Two-Level and Xerox implementationshas proved very successful in the productionof robust morphological analyzer-generators, in-cluding many large-scale commercial systems.

    However, it has long been recognized thatthese implementations have serious limitationsin handling non-concatenative phenomena. Wedescribe a new technique for constructing finite-state transducers that involves reapplying theregular-expression compiler to its own output.Implemented in an algorithm called compile-replace, this technique has proved useful forhandling non-concatenative phenomena; and wedemonstrate it on Malay full-stem reduplicationand Arabic stem interdigitation.

    1 IntroductionMost natural languages construct words by con-catenating morphemes together in strict orders.Such concatenative morphotactics can be im-pressively productive, especially in agglutina-tive languages like Aymara (Figure 11) or Turk-ish, and in agglutinative/polysynthetic lan-guages like Inuktitut (Figure 2)(Mallon, 1999,2). In such languages a single word may con-tain as many morphemes as an average-lengthEnglish sentence.

    Finite-state morphology in the tradition of

    the Two-Level (Koskenniemi, 1983) and Xeroximplementations (Karttunen, 1991; Karttunen,1994; Beesley and Karttunen, 2000) has beenvery successful in implementing large-scale,robust and efficient morphological analyzer-generators for concatenative languages, includ-ing the commercially important European lan-guages and non-Indo-European examples like

    1I wish to thank Stuart Newton for this example.

    Finnish, Turkish and Hungarian. However,Koskenniemi himself understood that his initialimplementation had significant limitations inhandling non-concatenative morphotactic pro-cesses:

    Only restricted infixation and redu-

    plication can be handled adequatelywith the present system. Some exten-sions or revisions will be necessary foran adequate description of languagespossessing extensive infixation or redu-plication (Koskenniemi, 1983, 27).

    This limitation has of course not escaped the no-tice of various reviewers, e.g. Sproat(1992). Weshall argue that the morphotactic limitations ofthe traditional implementations are the directresult of relying solely on the concatenation op-

    eration in morphotactic description.We describe a technique, within the Xerox

    implementation of finite-state morphology, thatcorrects the limitations at the source, going be-yond concatenation to allow the full range offinite-state operations to be used in morphotac-tic description. Regular-expression descriptionsare compiled into finite-state automata or trans-ducers (collectively called networks) as usual,and then the compiler is re-applied to its ownoutput, producing a modified but still finite-state network. This technique, implemented

    in an algorithm called compile-replace, hasalready proved useful for handling Malay full-stem reduplication and Arabic stem interdigi-tation, which will be described below. Beforeillustrating these applications, we will first out-line our general approach to finite-state mor-phology.

    2 Finite-State Morphology

    1

  • 7/28/2019 Finnish, Turkish and Hungarian

    2/12

    Lexical: uta+ma+na-ka+p+xa+samacha-i+waSurface: uta ma n ka p xa samach i wa

    uta = house (root)+ma = 2nd person possessive+na = in

    -ka = (locative, verbalizer)+p = plural+xa = perfect aspect+samacha = "apparently"-i = 3rd person+wa = topic marker

    Figure 1: Aymara: utamankapxasamachiwa = it appears that they are in your house

    Lexical: Paris+mut+nngau+juma+niraq+lauq+sima+nngit+jungaSurface: Pari mu nngau juma nira lauq sima nngit tunga

    Paris = (root = Paris)

    +mut = terminalis case ending+nngau = go (verbalizer)+juma = want+niraq = declare (that)+lauq = past+sima = (added to -lauq- indicates "distant past")+nngit = negative+junga = 1st person sing. present indic (nonspecific)

    Figure 2: Inuktitut: Parimunngaujumaniralauqsimanngittunga = I never said I wanted to go toParis

    2.1 Analysis and Generation

    In the most theory- and implementation-neutralform, morphological analysis and generation ofwritten words can be modeled as a relationbetween the words themselves and analyses ofthose words. Computationally, as shown in Fig-ure 3, a black-box module maps from words toanalyses to effect Analysis, and from analysesto words to effect Generation.

    ANALYZER/

    GENERATOR

    ANALYSES

    WORDS

    Figure 3: Morphological Analysis/Generationas a Relation between Analyses and Words

    The basic claim or hope of the finite-state ap-proach to natural-language morphology is thatrelations like that represented in Figure 3 are infact regular relations, i.e. relations between tworegular languages. The surface language con-sists of strings (= words = sequences of sym-bols) written according to some defined orthog-raphy. In a commercial application for a naturallanguage, the surface language to be modeledis usually a given, e.g. the set of valid Frenchwords as written according to standard Frenchorthography. The lexical language again con-sists of strings, but strings designed according

    to the needs and taste of the linguist, represent-ing analyses of the surface words. It is some-times convenient to design these lexical stringsto show all the constituent morphemes in theirmorphophonemic form, separated and identifiedas in Figures 1 and 2. In other applications,it may be useful to design the lexical stringsto contain the traditional dictionary citationform, together with linguist-selected tag sym-

    2

  • 7/28/2019 Finnish, Turkish and Hungarian

    3/12

    Regular

    Expression

    Compiler

    Analysis Strings

    FST

    Word Strings

    Figure 4: Compilation of a Regular Expression into an fst that Maps between Two RegularLanguages

    bols like +Noun, +Verb, +SG, +PL, that conveycategory, person, number, tense, mood, case,etc. Thus the lexical string representing paie,the first-person singular, present indicative form

    of the French verbpayer

    (to pay), might bespelled payer+IndP+SG+P1+Verb . The tag sym-bols are stored and manipulated just like al-phabetic symbols, but they have multicharacterprint names.

    If the relation is finite-state, then it can bedefined using the metalanguage of regular ex-pressions; and, with a suitable compiler, theregular expression source code can be compiledinto a finite-state transducer (fst), as shown inFigure 4, that implements the relation compu-tationally. Following convention, we will oftenrefer to the upper projection of the fst, repre-senting analyses, as the lexical language, a setof lexical strings; and we will refer to the lowerprojection as the surface language, consistingof surface strings. There are compelling advan-tages to computing with such finite-state ma-chines, including mathematical elegance, flexi-bility, and for most natural-language applica-tions, high efficiency and data-compaction.

    One computes with fsts by applying them,in either direction, to an input string. Whenone such fst that was written for French is ap-plied in an upward direction to the surface wordmaisons (houses), it returns the related string

    maison+Fem+PL+Noun, consisting of the citationform and tag symbols chosen by a linguist toconvey that the surface form is a feminine nounin the plural form. A single surface string canbe related to multiple lexical strings, e.g. ap-plying this fst in an upward direction to sur-face string suis produces the four related lexicalstrings shown in Figure 5. Such ambiguity of

    surface strings is very common.

    etre+IndP+SG+P1+Verbsuivre+IndP+SG+P2+Verbsuivre+IndP+SG+P1+Verb

    suivre+Imp+SG+P2+Verb

    Figure 5: Multiple Analyses for suis

    Conversely, the very same fst can be appliedin a downward direction to a lexical string likeetre+IndP+SG+P1+Verb to return the relatedsurface string suis; such transducers are inher-ently bidirectional. Ambiguity in the downwarddirection is also possible, as in the relation ofthe lexical string payer+IndP+SG+P1+Verb (Ipay) to the surface strings paie and paye, whichare in fact valid alternate spellings in standard

    French orthography.

    2.2 Morphotactics and Alternations

    There are two challenges in modeling naturallanguage morphology:

    Morphotactics

    Phonological/Orthographical Alternations

    Finite-state morphology models both usingregular expressions. The source descriptionsmay also be written in higher-level notations(e.g. lexc (Karttunen, 1993), twolc (Kart-tunen and Beesley, 1992) and Replace Rules(Karttunen, 1995; Karttunen, 1996; Kempe andKarttunen, 1996)) that are simply helpful short-hands for regular expressions and that compile,using their dedicated compilers, into finite-statenetworks. In practice, the most commonly sep-arated modules are a lexicon fst, containinglexical strings, and a separately written set of

    3

  • 7/28/2019 Finnish, Turkish and Hungarian

    4/12

    Lexicon

    Regular Expression

    RuleRegular Expression

    Compiler

    Lexicon FST

    .o.

    Rule FST

    Lexical Transducer

    (a single FST)

    Figure 6: Creation of a Lexical Transducer

    rule fsts that map from the strings in the lex-icon to properly spelled surface strings. Thelexicon description defines the morphotactics ofthe language, and the rules define the alterna-tions. The separately compiled lexicon and rulefsts can subsequently be composed together asin Figure 6 to form a single lexical transducer(Karttunen et al., 1992) that could have beendefined equivalently, but perhaps less perspicu-ously and less efficiently, with a single regularexpression.

    In the lexical transducers built at Xerox, thestrings on the lower side of the transducer areinflected surface forms of the language. Thestrings on upper side of the transducer con-tain the citation forms of each morpheme andany number of tag symbols that indicate theinflections and derivations of the correspond-ing surface form. For example, the informationthat the comparative of the adjective big is big-ger might be represented in the English lexicaltransducer by the path (= sequence of statesand arcs) in Figure 7 where the zeros repre-sent epsilon symbols.2 The gemination of g and

    Lexical side:

    b

    b

    i

    i

    g

    g

    g

    0

    0

    +Adj

    e

    0

    r

    +Comp

    Surface side:

    Figure 7: A Path in a Transducer for English

    the epenthetical e in the surface form bigger re-sult from the composition of the original lexiconfst with the rule fst representing the regularmorphological alternations in English.

    2The epsilon symbols and their placement in thestring are not significant. We will ignore them when-ever it is convenient.

    For the sake of clarity, Figure 7 represents theupper (= lexical) and the lower (= surface) sideof the arc label separately on the opposite sidesof the arc. In the remaining diagrams, we usea more compact notation: the upper and thelower symbol are combined into a single labelof the form upper:lower if the symbols are dis-tinct. A single symbol is used for an identitypair. In the standard notation, the path in Fig-ure 7 is labeled as

    b i g 0:g +Adj:0 0:e +Comp:r.Lexical transducers are more efficient for

    analysis and generation than the classical two-level systems (Koskenniemi, 1983) because themorphotactics and the morphological alterna-tions have been precompiled and need not beconsulted at runtime. But it would be possi-ble in principle, and perhaps advantageous forsome purposes, to view the regular expressionsdefining the morphology of a language as an un-compiled virtual network. All the finite-stateoperations (concatenation, union, intersection,composition, etc.) can be simulated by an ap-ply routine at runtime.

    Most languages build words by simply string-ing morphemes (prefixes, roots and suffixes)together in strict orders. The morpho-tactic (word-building) processes of prefixa-tion and suffixation can be straightforwardlymodeled in finite state terms as concatena-tion. But some natural languages also ex-hibit non-concatenative morphotactics. Some-times the languages themselves are called non-concatenative languages, but most employ sig-nificant concatenation as well, so the term notcompletely concatenative (Lavie et al., 1988)is usually more appropriate.

    In Arabic, for example, prefixes and suffixesattach to stems in the usual concatenative way,but stems themselves are formed by a process

    4

  • 7/28/2019 Finnish, Turkish and Hungarian

    5/12

    known informally as interdigitation; while inMalay, noun plurals are formed by a processknown as full-stem reduplication. AlthoughArabic and Malay also include prefixation andsuffixation that are modeled straightforwardlyby concatenation, a complete lexicon cannot be

    obtained without non-concatenative processes.We will proceed with descriptions of how

    Malay reduplication and Semitic stem interdig-itation are handled in finite-state morphologyusing the new compile-replace algorithm.

    3 Compile-Replace

    The central idea in our approach to the mod-eling of non-concatenative processes is to de-fine networks using regular expressions, as be-fore; but we now define the strings of an in-termediate network so that they contain ap-

    propriate substrings that are themselves in theformat of regular expressions. The compile-replace algorithm then reapplies the regular-expression compiler to its own output, compil-ing the regular-expression substrings in the in-termediate network and replacing them with theresult of the compilation.

    To take a simple non-linguistic example, Fig-ure 8 represents a network that maps the regu-lar expression a* into ^[a*^]; that is, the sameexpression enclosed between two special delim-iters, ^[ and ^], that mark it as a regular-

    expression substring.a *0:^[ 0:^]

    Figure 8: A Network with a Regular-ExpressionSubstring on the Lower Side

    The application of the compile-replace algo-rithm to the lower side of the network elimi-nates the markers, compiles the regular expres-sion a* and maps the upper side of the pathto the language resulting from the compilation.The network created by the operation is shownin Figure 9.

    When applied in the upward direction, thetransducer in Figure 9 maps any string of theinfinite a* language into the regular expressionfrom which the language was compiled.

    The compile-replace algorithm is essentiallya variant of a simple recursive-descent copyingroutine. It expects to find delimited regular-expression substrings on a given side (upper or

    a:0

    a

    *:0

    *:0

    0:a*:a

    Figure 9: After the Application of Compile-Replace

    lower) of the network. Until an opening delim-iter ^[ is encountered, the algorithm constructsa copy of the path it is following. If the net-work contains no regular-expression substrings,the result will be a copy of the original network.When a ^[ is encountered, the algorithm looksfor a closing ^] and extracts the path betweenthe delimiters to be handled in a special way:

    1. The symbols along the indicated side of thepath are concatenated into a string andeliminated from the path leaving just thesymbols on the opposite side.

    2. A separate network is created that containsthe modified path.

    3. The extracted string is compiled into asecond network with the standard regular-expression compiler.

    4. The two networks are combined into a sin-gle one using the crossproduct operation.

    5. The result is spliced between the states rep-resenting the origin and the destination ofthe regular-expression path.

    After the special treatment of the regular-expression path is finished, normal processingis resumed in the destination state of the clos-ing ^] arc. For example, the result shown inFigure 9 represents the crossproduct of the twonetworks shown in Figure 10.

    a

    a *

    Figure 10: Networks Illustrating Steps 2 and 3of the Compile-Replace Algorithm

    In this simple example, the upper language ofthe original network in Figure 8 is identical tothe regular expression that is compiled and re-placed. In the linguistic applications presented

    5

  • 7/28/2019 Finnish, Turkish and Hungarian

    6/12

    Lexical: b a g i +Noun +PluralSurface: ^[ { b a g i } ^ 2 ^]

    Lexical: p e l a b u h a n +Noun +PluralSurface: ^[ { p e l a b u h a n } ^ 2 ^]

    Figure 11: Two Paths in the Initial Malay Transducer Defined via Concatenation

    in the next sections, the two sides of a regular-expression path contain different strings. Theupper side contains morphological information;the regular-expression operators appear only onthe lower side and are not present in the finalresult.

    3.1 Reduplication

    Traditional Two-Level implementations are al-ready capable of describing some limitedreduplication and infixation as in Tagalog(Antworth, 1990, 156162). The more chal-lenging phenomenon is variable-length redupli-cation, as found in Malay and the closely relatedIndonesian language.

    An example of variable-length full-stem redu-plication occurs with the Malay stem bagi,which means bag or suitcase; this form isin fact number-neutral and can translate as theplural. Its overt plural is phonologically bagi-bagi,3 formed by repeating the stem twice in arow. Although this pluralization process mayappear concatenative, it does not involve con-catenating a predictable pluralizing morpheme,but rather copying the preceding stem, what-ever it may be and however long it may be.Thus the overt plural of pelabuhan (port), it-self a derived form, is phonologically pelabuhan-pelabuhan.

    Productive reduplication cannot be describedby finite-state or even context-free formalisms.It is well known that the copy language, {ww| w L}, where each word contains two copiesof the same string, is a context-sensitive lan-guage. However, if the base language L is

    finite, we can construct a finite-state networkthat encodes L and the reduplications of all thestrings in L. On the assumption that there areonly a finite number of words subject to redu-plication (no free compounding), it is possibleto construct a lexical transducer for languages

    3In the standard orthography, such reduplicatedwords are written with a hyphen, e.g. bagi-bagi, thatwe will ignore for this example.

    such as Malay. We will show a simple and el-egant way to do this with strictly finite-stateoperations.

    To understand the general solution to full-stem reduplication using the compile-replace al-gorithm requires a bit of background. In theregular expression calculus there are several op-erators that involve concatenation. For exam-ple, if A is a regular expression denoting a lan-guage or a relation, A* denotes zero or more and

    A+ denotes one or more concatenations of A withitself. There are also operators that express afixed number of concatenations. In the Xeroxcalculus, expressions of the form A^n, where n isan integer, denote n concatenations of A. {abc}denotes the concatenation of symbols a, b, andc. We also employ ^[ and ^] as delimiter sym-bols around regular-expression substrings.

    The reduplication of any string w can thenbe notated as {w}^2, and we start by defininga network where the lower-side strings are builtby simple concatenation of a prefix ^[, a root

    enclosed in braces, and an overt-plural suffix ^2followed by the closing ^]. Figure 11 shows thepaths for two Malay plurals in the initial net-work.

    The compile-replace algorithm, applied to thelower-side of this network, recognizes each in-dividual delimited regular-expression substringlike ^[{bagi}^2^], compiles it, and replaces itwith the result of the compilation, here bagi-bagi. The same process applies to the entirelower-side language, resulting in a network thatrelates pairs of strings such as the ones in Fig-

    ure 12. This provides the desired solution, stillfinite-state, for analyzing and generating full-stem reduplication in Malay.4

    4It is well-known (McCarthy and Prince, 1995) thatreduplication can be a more complex phenomenon thanit is in Malay. In some languages only a part of the stemis reduplicated and there may be systematic differencesbetween the reduplicate and the base form. We believethat our approach to reduplication can account for thesecomplex phenomena as well but we cannot discuss the

    6

  • 7/28/2019 Finnish, Turkish and Hungarian

    7/12

    Lexical: b a g i +Noun +PluralS u r f a c e : b a g i b a g i

    Lexical: p e l a b u h a n +Noun +PluralS u r f a c e : p e l a b u h a n p e l a b u h a n

    Figure 12: The Malay fst After the Application of Compile-Replace to the Lower-Side Language

    The special delimiters ^[ and ^] can beused to surround any appropriate regular-expression substring, using any necessaryregular-expression operators, and compile-replace may be applied to the lower-side and/orupper-side of the network as desired. Thereis nothing to stop the linguist from insertingdelimiters multiple times, including via compo-sition, and reapplying compile-replace multipletimes (see the Appendix). The technique im-

    plemented in compile-replace is a general wayof allowing the regular-expression compiler toreapply to and modify its own output.

    3.2 Semitic Stem Interdigitation

    3.2.1 Review of Earlier Work

    Much of the work in non-concatenative finite-state morphotactics has been dedicated to han-dling Semitic stem interdigitation. An exampleof interdigitation occurs with the Arabic stemkatab, which means wrote. According to aninfluential autosegmental analysis (McCarthy,

    1981), this stem consists of an all-consonantroot ktb whose general meaning has to do withwriting, an abstract consonant-vowel templateCVCVC, and a voweling or vocalization that hesymbolized simply as a, signifying perfect as-pect and active voice. The root consonants areassociated with the C slots of the template andthe vowel or vowels with the V slots, producinga complete stem katab. If the root and the vo-calization are thought of as morphemes, neithermorpheme occurs continuously in the stem. Thesame root ktb can combine with the template

    CVCVC and a different vocalization ui, signifyingperfect aspect and passive voice, producing thestem kutib, which means was written. Simi-larly, the root ktb can combine with templateCVVCVC and ui to produce kuutib, the root drscan combine with CVCVC and ui to form duris,and so forth.

    Kay (1987) reformalized the autosegmental

    issue here due to lack of space.

    tiers of McCarthy (1981) as projections of amulti-level transducer and wrote a small Prolog-based prototype that handled the interdigita-tion of roots, CV-templates and vocalizationsinto abstract Arabic stems; this general ap-proach, with multi-tape transducers, has beenexplored and extended by Kiraz in several pa-pers (1994a; 1996; 1994b; 2000) with respect toSyriac and Arabic. The implementation is de-scribed in Kiraz and Grimley-Evans (1999).

    In work more directly related to the currentsolution, it was Kataja and Koskenniemi (1988)who first demonstrated that Semitic (Akkadian)roots and patterns5 could be formalized as reg-ular languages, and that the non-concatenativeinterdigitation of stems could be elegantly for-malized as the intersection of those regular lan-guages. Thus Akkadian words were formalizedas consisting of morphemes, some of which werecombined together by intersection and others ofwhich were combined via concatenation.

    This was the key insight: morphotactic de-

    scription could employ various finite-state op-erations, not just concatenation; and languagesthat required only concatenation were just spe-cial cases. By extension, the widely noticed lim-itations of early finite-state implementations indealing with non-concatenative morphotacticscould be traced to their dependence on the con-catenation operation in morphotactic descrip-tions.

    This insight of Kataja and Koskenniemi wasapplied by Beesley in a large-scale morphologi-cal analyzer for Arabic, first using an implemen-

    tation that simulated the intersection of stemsin code at runtime (Beesley, 1989; Beesley etal., 1989; Beesley, 1990; Beesley, 1991), and ranrather slowly; and later, using Xerox finite-statetechnology (Beesley, 1996; Beesley, 1998a), anew implementation that intersected the stemsat compile time and performed well at runtime.

    5These patterns combine what McCarthy (1981)would call templates and vocalizations.

    7

  • 7/28/2019 Finnish, Turkish and Hungarian

    8/12

    The 1996 algorithm that intersected roots andpatterns into stems, and substituted the originalroots and patterns on just the lower side withthe intersected stem, was admittedly rather adhoc and computationally intensive, taking overtwo hours to handle about 90,000 stems on a

    SUN Ultra workstation. The compile-replacealgorithm is a vast improvement in both gener-ality and efficiency, producing the same resultin a few minutes.

    Following the lines of Kataja and Kosken-niemi (1988), we could define intermediate net-works with regular-expression substrings thatindicate the intersection of suitably encodedroots, templates, and vocalizations (for a for-mal description of what such regular-expressionsubstrings would look like, see Beesley (1998c;1998b)). However, the general-purpose inter-

    section algorithm would be expensive in anynon-trivial application, and the interdigitationof stems represents a special case of intersectionthat we achieve in practice by a much more ef-ficient finite-state algorithm called merge.

    3.2.2 Merge

    The merge algorithm is a pattern-filling oper-ation that combines two regular languages, atemplate and a filler, into a single one. Thestrings of the filler language consist of ordinarysymbols such as d, r, s, u, i. The templateexpressions may contain special class symbolssuch as C (= consonant) or V (= vowel) thatrepresent a predefined set of ordinary symbols.The objective of the merge operation is to alignthe template strings with the filler strings andto instantiate the class symbols of the templateas the matching filler symbols.

    Like intersection, the merge algorithm oper-ates by following two paths, one in the templatenetwork, the other in the filler network, and itconstructs the corresponding single path in theresult network. Every state in the result corre-sponds to two original states, one in template,the other in the filler. If the original states areboth final, the resulting state is also final; oth-erwise it is non-final. In other words, in order toconstruct a successful path, the algorithm mustreach a final state in both of the original net-works. If the new path terminates in a non-finalstate, it represents a failure and will eventuallybe pruned out.

    The operation starts in the initial state of the

    original networks. At each point, the algorithmtries to find all the successful matches betweenthe template arcs and filler arcs. A match issuccessful if the filler arc symbol is included inthe class designated by the template arc sym-bol. The main difference between merge and

    classical intersection is in Conditions 1 and 2below:

    1. If a successful match is found, a new arc isadded to the current result state. The arcis labeled with the filler arc symbol; its des-tination is the result state that correspondsto the two original destinations.

    2. If no successful match is found for a giventemplate arc, the arc is copied into the cur-rent result state. Its destination is the re-sult state that corresponds to the destina-tion of the template arc and the currentfiller state.

    In effect, Condition 2 preserves any templatearc that does not find a match. In that case,the path in the template network advances toa new state while the path in the filler networkstays at the current state.

    We use the networks in Figure 13 to illustratethe effect of the merge algorithm. Figure 13shows a linear template network and two fillernetworks, one of which is cyclic.

    iu

    C V V C V C

    d r s

    Figure 13: A Template Network and Two FillerNetworks

    It is easy to see that the merge of the drsnetwork with the template network yields theresult shown in Figure 14. The three symbols

    of the filler string are instantiated in the threeconsonant slots in the CVVCVC template.

    d V V r V s

    Figure 14: Intermediate Result.

    Figure 15 presents the final result in whichthe second filler network in Figure 13 is mergedwith the intermediate result shown in Figure 14.

    8

  • 7/28/2019 Finnish, Turkish and Hungarian

    9/12

    Lexical: k t b =Root C V C V C =Template a + =VocSurface: ^[ k t b .m>. C V C V C .. C V C V C .. C V V C V C . . C V V C V C . < m . u * i

    As we have defined them, .. areweakly binding left-associative operators. Inthis example, the first merge instantiates thefiller consonants, the second operation fills thevowel slots. However, the order in which the

    merge operations are performed is irrelevant inthis case because the two filler languages do notprovide competing instantiations for the sameclass symbols.

    3.2.3 Merging Roots and Vocalizationswith Templates

    Following the tradition, we can represent thelexical forms of Arabic stems as consisting ofthree components, a consonantal root, a CV tem-plate and a vocalization, possibly preceded andfollowed by additional affixes. In contrast toMcCarthy, Kay, and Kiraz, we combine thethree components into a single projection. In asense, McCarthys three tiers are conflated intoa single one with three distinct parts. In ouropinion, there is no substantive difference froma computational point of view.

    For example, the initial lexical representationof the surface forms katab, kutib, and duuris,may be represented as a concatenation of thethree components shown in Figure 16. We usethe symbols =Root, =Template, and =Voc todesignate the three components of the lexicalform. The corresponding initial surface form isa regular-expression substring, containing twomerge operators, that will be compiled and re-placed by the interdigitated surface form.

    The application of the compile-replace opera-tion to the lower side of the initial lexicon yieldsa transducer that maps the Arabic interdigi-tated forms directly into their corresponding tri-

    partite analyses and vice versa, as illustrated inFigure 17.

    Alternation rules are subsequently composedon the lower side of the result to map the in-terdigitated, but still morphophonemic, stringsinto real surface strings.

    Although many Arabic templates are widelyconsidered to be pure CV-patterns, it hasbeen argued that certain templates also contain

    9

  • 7/28/2019 Finnish, Turkish and Hungarian

    10/12

    Lexical: k t b =Root C V C V C =Template a + =VocSurface: k a t a b

    Lexical: k t b =Root C V C V C =Template u * i =VocSurface: k u t i b

    Lexical: d r s =Root C V V C V C =Template u * i =VocSurface: d u u r i s

    Figure 17: After Applying Compile-Replace to the Lower Side

    hard-wired specific vowels and consonants.6

    For example, the so-called FormVIII templateis considered, by some linguists, to contain anembedded t: CtVCVC.

    The presence of ordinary symbols in the tem-plate does not pose any problem for the anal-ysis adopted here. As we already mentioned

    in discussing the intermediate representation inFigure 14, the merge operation treats ordinarysymbols in a partially filled template in thesame manner as it treats unmatched class sym-bols. The merge of a root such as ktb with thepresumed FormVIII template and the a+ vocal-ism,

    k t b . m > . C t V C V C . < m . a +

    produces the desired result, ktatab, withoutany additional mechanism.

    4 Status of the Implementations

    4.1 Malay MorphologicalAnalyzer/Generator

    Malay and Indonesian are closely-related lan-guages characterized by rich derivation andlittle or nothing that could be called inflec-tion. The Malay morphological analyzer pro-totype, written using lexc, Replace Rules, andcompile-replace, implements approximately 50different derivational processes, including pre-fixation, suffixation, prefix-suffix pairs (circum-fixation), reduplication, some infixation, andcombinations of these processes. Each root is

    marked manually in the source dictionary to in-dicate the idiosyncratic subset of derivationalprocesses that it undergoes.

    The small prototype dictionary, stored inan XML format, contains approximately 1000roots, with about 1500 derivational subentries(i.e. an average of 1.5 derivational processes

    6 See Beesley (1998c) for a discussion of this contro-versial issue.

    per root). At compile time, the XML dictio-nary is parsed and downtranslated into thesource format required for the lexc compiler.The XML dictionary could be expanded by anycompetent Malay lexicographer.

    4.2 Arabic MorphologicalAnalyzer/Generator

    The current Arabic system has been describedin some detail in previous publications (Beesley,1996; Beesley, 1998a; Beesley, 1998b) and isavailable for testing on the Internet.7 The modi-fication of the system to use the compile-replacealgorithm has not changed the size or the behav-ior of the system in any way, but it has reducedthe compilation time from hours to minutes.

    5 Conclusion

    The well-founded criticism of traditional imple-

    mentations of finite-state morphology, that theyare limited to handling concatenative morpho-tactics, is a direct result of their dependenceon the concatenation operation in morphotacticdescription. The technique described here, im-plemented in the compile-replace algorithm, al-lows the regular-expression compiler to reapplyto and modify its own output, effectively freeingmorphotactic description to use any finite-stateoperation. Significant experiments with Malayand a much larger application in Arabic haveshown the value of this technique in handling

    two classic examples of non-concatenative mor-photactics: full-stem reduplication and Semiticstem interdigitation. Work remains to be donein applying the technique to other known vari-eties of non-concatenative morphotactics.

    The compile-replace algorithm and the mergeoperator introduced in this paper are generaltechniques not limited to handling the specific

    7 http://www.xrce.xerox.com/research/mltt/arabic/

    10

    http://www.xrce.xerox.com/research/mltt/arabic/http://www.xrce.xerox.com/research/mltt/arabic/
  • 7/28/2019 Finnish, Turkish and Hungarian

    11/12

    morphotactic problems we have discussed. Weexpect that they will have many other usefulapplications. One illustration is given in theAppendix.

    6 Appendix: Palindrome Extraction

    To demonstrate the power of the compile-replace method, let us show how it can be ap-plied to solve another hard problem: identi-fying and extracting all the palindromes from alexicon. Like reduplication, palindrome identifi-cation appears at first to require more powerfultools than a finite-state calculus. But this taskcan be accomplished, in fact quite efficiently, byusing the compile-replace technique.

    Let us assume that L is a simple network con-structed from an English wordlist. We start by

    extracting fromL

    all the words with a propertythat is necessary but not sufficient for being apalindrome, namely, the words whose inverse isalso an English word. This step can be accom-plished by redefining L as [L & L.r] where &represents intersection and .r is the reverse op-erator. The resulting network contains palin-dromes such as madam as well non-palindromessuch as dog and god.

    The remaining task is to eliminate all thewords like dog that are not identical to theirown inverse. This can be done in threesteps. We first apply the technique used forMalay reduplication. That is, we redefine Las "^[" "[" L XX "]" "^" 2 "^]", and applythe compile-replace operation. At this p ointthe lower-side of L contains strings such asdogXXdogXX and madamXXmadamXX where XX isa specially introduced symbol to mark the mid-dle (and the end) of each string.

    The next, and somewhat delicate, step is toreplace the XX markers by the desired opera-tors, intersection and reverse, and to wrap thespecial regular expression delimiters ^[ and ^]around the whole lexicon. This can be done bycomposing L with one or several replace trans-ducers to yield a network consisting of expres-sions such as ^ [ d o g & [ d o g ] . r ^ ] and^ [ m a d a m & [ m a d a m ] . r ^ ]

    In the third and final step, the applicationof compile-replace eliminates words like dogbecause the intersection of dog with the in-verted form god is empty. Only the palin-dromes survive the operation. The extrac-

    tion of all the palindromes from the 25K Unix/usr/dict/words file by this method takes a cou-ple of seconds.

    References

    Evan L. Antworth. 1990. PC-KIMMO: a two-level processor for morphological analysis.Number 16 in Occasional publications in aca-demic computing. Summer Institute of Lin-guistics, Dallas.

    Kenneth R. Beesley and Lauri Karttunen.2000. Finite-State Morphology: Xerox Toolsand Techniques. Cambridge University Press.Forthcoming.

    Kenneth R. Beesley, Tim Buckwalter, and Stu-art N. Newton. 1989. Two-level finite-stateanalysis of Arabic morphology. In Proceed-ings of the Seminar on Bilingual Computingin Arabic and English, Cambridge, England,September 6-7. No pagination.

    Kenneth R. Beesley. 1989. Computer analysisof Arabic morphology: A two-level approachwith detours. In Third Annual Symposium onArabic Linguistics, Salt Lake City, March 3-4. University of Utah. Published as Beesley,1991.

    Kenneth R. Beesley. 1990. Finite-state de-scription of Arabic morphology. In Proceed-ings of the Second Cambridge Conference onBilingual Computing in Arabic and English,September 5-7. No pagination.

    Kenneth R. Beesley. 1991. Computer analy-sis of Arabic morphology: A two-level ap-proach with detours. In Bernard Comrie andMushira Eid, editors, Perspectives on ArabicLinguistics III: Papers from the Third An-nual Symposium on Arabic Linguistics, pages155172. John Benjamins, Amsterdam. Readoriginally at the Third Annual Symposium onArabic Linguistics, University of Utah, SaltLake City, Utah, 3-4 March 1989.

    Kenneth R. Beesley. 1996. Arabic finite-statemorphological analysis and generation. InCOLING96, volume 1, pages 8994, Copen-hagen, August 5-9. Center for Sprogteknologi.The 16th International Conference on Com-putational Linguistics.

    Kenneth R. Beesley. 1998a. Arabic morphologi-cal analysis on the Internet. In ICEMCO98,Cambridge, April 17-18. Centre for MiddleEastern Studies. Proceedings of the 6th Inter-

    11

  • 7/28/2019 Finnish, Turkish and Hungarian

    12/12

    national Conference and Exhibition on Multi-lingual Computing. Paper number 3.1.1; nopagination.

    Kenneth R. Beesley. 1998b. Arabic morphologyusing only finite-state operations. In MichaelRosner, editor, Computational Approaches toSemitic Languages: Proceedings of the Work-shop, pages 5057, Montreal, Quebec, Au-gust 16. Universite de Montreal.

    Kenneth R. Beesley. 1998c. Arabic stem mor-photactics via finite-state intersection. Paperpresented at the 12th Symposium on Ara-bic Linguistics, Arabic Linguistic Society, 6-7March, 1998, Champaign, IL.

    Lauri Karttunen and Kenneth R. Beesley. 1992.Two-level rule compiler. Technical ReportISTL-92-2, Xerox Palo Alto Research Center,Palo Alto, CA, October.

    Lauri Karttunen, Ronald M. Kaplan, and AnnieZaenen. 1992. Two-level morphology withcomposition. In COLING92, pages 141148,Nantes, France, August 23-28.

    Lauri Karttunen. 1991. Finite-state con-straints. In Proceedings of the Interna-tional Conference on Current Issues in Com-putational Linguistics, Penang, Malaysia,June 10-14. Universiti Sains Malaysia.

    Lauri Karttunen. 1993. Finite-state lexiconcompiler. Technical Report ISTL-NLTT-

    1993-04-02, Xerox Palo Alto Research Center,Palo Alto, CA, April.

    Lauri Karttunen. 1994. Constructing lexicaltransducers. In COLING94, Kyoto, Japan.

    Lauri Karttunen. 1995. The replace oper-ator. In ACL95, Cambridge, MA. cmp-lg/9504032.

    Lauri Karttunen. 1996. Directed replace-ment. In ACL96, Santa Cruz, CA. cmp-lg/9606029.

    Laura Kataja and Kimmo Koskenniemi. 1988.Finite-state description of Semitic morphol-

    ogy: A case study of Ancient Akkadian. InCOLING88, pages 313315.

    Martin Kay. 1987. Nonconcatenative finite-state morphology. In Proceedings of the ThirdConference of the European Chapter of theAssociation for Computational Linguistics,pages 210.

    Andre Kempe and Lauri Karttunen. 1996. Par-allel replacement in finite-state calculus. In

    COLING96, Copenhagen, August 59. cmp-lg/9607007.

    George Anton Kiraz and Edmund Grimley-Evans. 1999. Multi-tape automata for speechand language systems: A Prolog implemen-tation. In Jean-Marc Champarnaud, De-

    nis Maurel, and Djelloul Ziadi, editors, Au-tomata Implementation, volume 1660 of Lec-ture Notes in Computer Science. SpringerVerlag, Berlin, Germany.

    George Anton Kiraz. 1994a. Multi-tape two-level morphology: a case study in Semiticnon-linear morphology. In COLING94, vol-ume 1, pages 180186.

    George Anton Kiraz. 1994b. Multi-tape two-level morphology: a case study in Semiticnon-linear morphology. In Proceedings of the15th International Conference on Computa-

    tional Linguistics, Kyoto, Japan.George Anton Kiraz. 1996. Semhe: A gener-

    alised two-level system. In Proceedings of the34th Annual Meeting of the Association ofComputational Linguistics, Santa Cruz, CA.

    George Anton Kiraz. 2000. Multi-tiered non-linear morphology: A case study on Semitic.Computational Linguistics, 26(1).

    Kimmo Koskenniemi. 1983. Two-level mor-phology: A general computational model forword-form recognition and production. Pub-lication 11, University of Helsinki, Depart-

    ment of General Linguistics, Helsinki.Alon Lavie, Alon Itai, Uzzi Ornan, and Mori Ri-

    mon. 1988. On the applicability of two levelmorphology to the inflection of Hebrew verbs.In Proceedings of ALLC III, pages 246260.

    Mick Mallon. 1999. Inuktitut linguisticsfor technocrats. Technical report, Ittuku-luuk Language Programs, Iqaluit, Nunavut,Canada.

    John J. McCarthy and Alan Prince. 1995.Faithfulness and reduplicative identity. Oc-casional papers in Linguistics 18, University

    of Massachusetts, Amherst, MA. ROA-60.John J. McCarthy. 1981. A prosodic theory of

    nonconcatenative morphology. Linguistic In-quiry, 12(3):373418.

    Richard Sproat. 1992. Morphology and Compu-tation. MIT Press, Cambridge, MA.

    12