representing lexical knowledge for bulgarian … · representing lexical knowledge for bulgarian...

5

Click here to load reader

Upload: trinhcong

Post on 05-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Representing Lexical Knowledge for Bulgarian … · Representing Lexical Knowledge for Bulgarian Inflectional Morphology in DATR VELISLAVA STOYKOVA Institute for Bulgarian Language

Representing Lexical Knowledge forBulgarian Inflectional Morphology in DATR

VELISLAVA STOYKOVAInstitute for Bulgarian LanguageBulgarian Academy of Sciences

52, Shipchensky proh. str. bl. 17, 1113 [email protected]

Abstract:The paper analyses the application of DATR language for lexical knowledge presentation for interpretingBulgarian inflectional morphology. It discuss the semantic network of the feature of definiteness in Bulgarianlanguage and compares the lexical knowledge representation for the different part-of-speech with respect to thedefined grammar rules, the sound alternations, the related formal presentation – both table and computational –and its application to lexicography.

Key–Words:Grammar knowledge presentation, semantic networks, computational morphology, DATR languagefor lexical knowledge presentation, semantic networks visualisation.

1 Introduction

Recent development in the information technologiesoffers artificial intelligence based methodologies foralmost all fields of scientific research, and can re-sult in a wide range of real applications. These ap-plications can significantly improve the effectivenessof the research especially in such part of it where aresearcher needs to perform tasks in searching large-scale database, and using his own knowledge and in-tuition to make decisions about them. The modernlexicography uses extensively the electronic text cor-pora of different genres instead text archives used inthe past. Existing electronic text collections of dif-ferent types (www-source, electronic archives, textcorpora, etc.) increase the necessity of proper soft-ware tools assisting different types of search to makethe best use of them. Different dictionaries offer dif-ferent conception of word definition of the sense butthe problem of optimizing searching and finding theword context is central. It is specific for flective lan-guages where different inflected forms share a com-mon meaning. So, the problem of creating tools forknowledge-based search in large-scale electronic textcorpora able to deal with both base and inflected wordform is a central for optimizing the process of com-pilation of different types of dictionaries. Further, weare going to discuss the principles and application ofa computationally tractable model of Bulgarian inflec-tional morphology.

2 Bulgarian language is a flectivelanguage

The standard Bulgarian language does not use casesfor syntactic representation but it has very rich inflec-tional system – both for derivational and for inflec-tional morphology, and it uses prepositions and a baseword form instead a case declination. It is consid-ered to be a language using relatively free word order,so the subject can take every syntactic position in thesentence (including the last one). Another importantgrammar feature of Bulgarian is the feature of defi-nite article which is an ending morpheme [11]. Thefact gives a priority to morphological interpretationsof definiteness in spite of syntactic. At the level ofsyntax, the definite article shows the subject (when itis not a proper name). So that, modeling inflectionalmorphology of definite article is important stage of asuccessful part-of-speech parsing of Bulgarian.

3 The semantics of definiteness andits formal morphological marker

According to traditional academic descriptive gram-mar works [11], the semantics of definiteness in stan-dard Bulgarian is expressed in three ways: lexical,morphological, and syntactic. The lexical way isclosely related to lexical semantics of a particular lex-eme. At syntactic level, the definiteness in Bulgarianexpress various types of semantic relationships like

LATEST TRENDS on COMPUTERS (Volume II)

ISSN: 1792-4251 612 ISBN: 978-960-474-213-4

Page 2: Representing Lexical Knowledge for Bulgarian … · Representing Lexical Knowledge for Bulgarian Inflectional Morphology in DATR VELISLAVA STOYKOVA Institute for Bulgarian Language

a case (to show subject), part-of-whole, deixis, etc.Also, the definite article can assign an individual orquantity definiteness, and it has a generic use as well.The syntactic function of definiteness in Bulgarian isexpressed by a formal morphological marker whichis an ending morpheme. It is different for genders,however, for masculine gender two types of definitemorphemes exist – to determine entities defined in adifferent way, which have two phonetic alternations,respectively. For feminine and for neuter gender onlyone definite morpheme exists, respectively. For plu-ral, two definite morphemes are used depending onthe ending vocal of the main plural form. The fol-lowing part-of-speech in Bulgarian take definite arti-cle: nouns, adjectives, numerals (both cardinals andordinals), possessive pronouns (the full forms), andreflexive-possessive pronoun (its full form). The def-inite article is the same for all part-of-speech but ithas different forms to account for the feature of gen-der and number. Further, we are going to analyzethe interpretations of definiteness in Bulgarian givenin Stoykova [13] for nominal inflectional morphol-ogy, Stoykova [14] for adjectives and numerals inflec-tional morphology, and Stoykova [17] for pronounsinflectional morphology, and mainly the principlesand types of lexical knowledge representation of theseapplications.

4 The traditional academic repre-sentation and computational mor-phology formal models representa-tion of inflectional morphology

The traditional interpretation of inflectional morphol-ogy given at the academic descriptive grammar works[11] is a presentation of tables. The tables consistof all possible inflected forms of a related word withrespect to its subsequent grammar features. The ar-tificial intelligence (AI) techniques offer a computa-tionally tractable encoding preceded by a related se-mantic analysis, which suggest a subsequent archi-tecture. Representing inflectional morphology in AIframeworks, is, in fact, to represent a specific type ofgrammar knowledge. The standard computational ap-proach to both derivational and inflectional morphol-ogy is to represent words as a rule-based concatena-tion of morphemes, and the main task is to constructrelevant rules for their combinations. With respect tonumber and types of morphemes, the different the-ories offer different approaches depending on varia-tions of either stems or suffixes as follows:

(i) Conjugational solution offers invariant stemand variant suffixes, and

(ii) Variant stem solution offers variant stems andinvariant suffix.

However, for Bulgarian a ”mixed” approach is amost appropriate because it considers both stems andsuffixes as variables and, also, can account for the spe-cific phonetic alternations. Also, the DATR languagefor lexical knowledge presentation is a suitable formalframework for presenting inflectional morphology ofBulgarian definite article.

5 The DATR language

The DATR language is a non-monotonic languagefor defining inheritance networks through path/valueequations [10]. It has both an explicit declarativesemantics and an explicit theory of inference allowingefficient implementation, and at the same time, it hasthe necessary expressive power to encode the lexicalentries presupposed by the work in the unificationgrammar tradition [7], [8], [9]. In DATR, informationis organized as a network of nodes, where a nodeis a collection of related information. Each nodehas associated with it a set of equations that definepartial functions from paths to values where pathsand values are both sequences of atoms. Atoms inpaths are sometimes referred to as attributes. DATR isfunctional, it defines a mapping which assigns uniquevalues to node attribute-path pair, and the recoveryof this values is deterministic. It can account forsuch language phenomena like regularity, irregularity,and subregularity, and allows the use of determin-istic parsing. The DATR language has a lot ofimplementations, however, the analyzed applicationwas made by using QDATR 2.0(consult URLhttp://www.cogs.susx.ac.uk/lab/nlp/datr/datrnode49.html for a related filebul_det.dtr). This PROLOG encoding usesSussex DATR notation [18]. DATR allows construc-tion of various types of language models (languagetheories), however, the entire application presents arule-based formal grammar and a lexical database.The particular query to be evaluated is a relatedinflected word forms, and the implementation allowsto process words in Cyrillic alphabet.

6 The general architecture of the ap-plication

The analyzed application of inflectional morphologyof Bulgarian definite article is linguistically moti-vated. In particular, the underlying basic idea is thatof a paradigm since morphemes are defined to be ofsemantic value and are considered as a realisation of

LATEST TRENDS on COMPUTERS (Volume II)

ISSN: 1792-4251 613 ISBN: 978-960-474-213-4

Page 3: Representing Lexical Knowledge for Bulgarian … · Representing Lexical Knowledge for Bulgarian Inflectional Morphology in DATR VELISLAVA STOYKOVA Institute for Bulgarian Language

(i) definite

morphemes

(ii) plural

morphemes

(iv) lexical

database

(iii) noun type

hierrarchy

(query)

Inflecting forms

Figure 1: The general architecture of the model.

a specific morphosyntactic phenomenon. The wordsare encoded following the traditional notion of lex-eme and different roots are introduced to account forthe related phonetic alternations, which are defined tobe of semantic value. Some other DATR applicationswhich present Slavonic inflectional morphology areavailable for Polish, Russian, Czech, Slovene. Theyoffer different insights for presenting both nominalinflectional morphology and verbal inflectional mor-phology. The nominal inflectional morphology inter-pretations deals mostly with case morphology presen-tation and use as underlying idea that of a paradigm –like the applications made for Polish [5] and Russian[4]. As for the verbal inflectional morphology presen-tation, the interpretations for Czech [12], for Russian[1], and for Slovene [6] use as underlying idea that of aconjugation. Some ideas for inflectional morphologyrepresentation are indebted to that of Cahill and Gaz-dar [2], [3] used to account for German inflectionalmorphology, however, their account of morphotacticswas not applied. The architecture represents an in-heritance network consisting of various nodes whichallows to account for all related inflected word formswithin the framework of one grammar theory. Thus,the general architecture of the application is as follows(Fig. 1):

(i) All definite inflecting morphemes for all formsof definite article attached to nodeDET and defined bytheir values through the paths<masc>, <masc_1>,<femn>, <neut>, and<plur>.

(ii) 12 inflecting morphemes for generating pluralforms defined at nodeSuff.

(iii) The inflectional rules defined as concatena-tions of morphemes for generation of all possible in-flected forms attached to the related inflectional typesnodes.

(iv) The words are given as a lexical database at-tached to their inflectional type nodes, respectively.They are defined by the lexical entries through paths<root>, <root gend>, and<root plur>, soto account for the different phonological alternations.The non-inflectional features are given as invariables,

and are defined with their particular values for the re-lated words (only for the pronouns).

(v) The queries to be evaluated are all possible in-flected word forms which are produced after the stageof the compilation.

7 The principles of lexical knowledgeencoding for Bulgarian inflectionalmorphology

The DATR logical representation framework usesrule-based reasoning with non-monotonic inferenceand default inheritance to represent the inflectionalrules in semantic network. It suggests the structure ofsemantic network that can employ the generalization-capturing rules in which the grammar knowledge isencoded by the attachment of inflectional rules to therelated nodes. In principle, DATR permits multi-ple default inheritance and prioritized inheritance en-forced by orthogonal representation, and suggest thelexicon being structured mostly by inheritance. Thistechnique allows to account for the grammar irregu-larities and to use the compilation rules which cangenerate all possible inflected forms within one ap-plication. The general morphological theory offer asegmentation of word which consists of root to whichprefixes, suffixes or endings are attached. In Bul-garian, all three types of morphemes are presented.However, different morphemes are used to account forthe feature of gender and number, which suggest thehierarchical structure of the lexical representation inwhich the feature of gender is a trigger to change thevalues of the inflected forms. During the process ofinflection, also, various phonetic alternations are tak-ing place. The phonetic alternations at the morphemeboundary are interpreted either by defining new gram-mar rules or new nodes, and the phonetic alternationsinside morphemes are interpreted by introducing dif-ferent roots. It is possible, also, to use the techniqueof finite state transducers [15]. The application [16]interprets, also, more complicated cases of inflection,where both prefixes and suffixes can be processed bydefining different nodes of the network. The seman-tic principle is used, also, for the encoding of nouns,adjectives, numerals, and pronouns since the inflec-tional rules are represented by taking into account thesemantics of the grammar features of the related part-of-speech and their internal hierarchy. In general, theinterpretation presents a tabular conceptualisation in-ference task of which the conceptual representation isbased on accounting for the orthographic, phonetic,morphological, and semantic properties.

LATEST TRENDS on COMPUTERS (Volume II)

ISSN: 1792-4251 614 ISBN: 978-960-474-213-4

Page 4: Representing Lexical Knowledge for Bulgarian … · Representing Lexical Knowledge for Bulgarian Inflectional Morphology in DATR VELISLAVA STOYKOVA Institute for Bulgarian Language

7.1 Inflectional rules and lexical representa-tion examples

Thus, we will analyze the fragments of DATR encod-ing [13], [14], [17] and we start the analysis of nounsinflectional morphology with nodeDETwhich definesall inflecting morphemes for the definite article for allpart-of-speech as follows:1

DET: <sing undef> ==<sing def_2 masc> == _ja<sing def_2 masc_1> == _a<sing def_1 masc> == _jat<sing def_1 masc _1> == _ut<sing def_1 femn> == _ta<sing def_1 neut> == _to<plur undef> ==<plur def_1> == _te.

NodeSuff defines all inflected morphemes forplural and is as follows.

Suff: <suff_11> == _i<suff_111> == _ovci<suff_12> == _e<suff_121> == _ove<suff_122> == _eve<suff_123> == _ovce<suff_21> == _a<suff_22> == _ja<suff_211> == _ishta<suff_212> == _ta<suff_213> == _ena<suff_214> == _esa.

NodeNoun defines the inflectional rules for nouns.It takes the information given through paths<root>and<root plur> (defined at the related lexeme),for the <stem>, and<gender> (for the inflectedmorphemes of a related gender defined at nodeDET),and <plur> (for the related plural inflected mor-phemes defined at nodeSuff). The node consists ofgrammar rules for generating all inflected noun wordforms for the features of number and definiteness.

Noun:<suff> == suff_11<gender> == masc_1<> == <stem> DET:<Idem "<gender>"><stem sing>=="<root sing>"<stem plur>=="<root plur>"Suff:<"<suff>">.

1Here and elsewhere in the description we use Latin alphabetto present morphemes instead Cyrillic used normally. Becauseof mismatching between both some of typically Bulgarian phono-logical alternations are assigned by two letters, whereas in Cyrillicalphabet they are marked by one.

The example lexeme of word for ’newspaper’ isdefined by<root> and<root plur>.

Vestnik: <> == Noun<root> == vestnik<root plur> == vestnic.

The following inflected word forms are generatedaccording to the defined grammar rules:

Vestnik: <gender> == masc_1.Vestnik: <sing undef> == vestnik.Vestnik: <plur undef> == vestnic_i.Vestnik: <sing def_1> == vestnik_ut.Vestnik: <sing def_2> == vestnik_a.Vestnik: <plur def_1> == vestnic_i_te.

The adjectives inflectional type hierarchy uses thedefinite morphemes of the nodeDET. The nodeAdjGdefines the grammar rules for generating all inflectedword forms for the feature of gender, number, and def-initeness as follows:

AdjG:<sing undef masc> == "<root>"<sing undef femn> == "<root gend>" _a<sing undef neut> == "<root gend>" _o<sing def_2 masc>=="<plur undef masc>"DET<sing def_1 masc>=="<plur undef masc>"DET<sing def_1> == "<sing undef>" DET<plur undef> == "<root gend>" _i<plur def_1> == "<plur undef>" DET.

NodeAdj employs the grammar rules for gener-ating the forms for the comparison of degree whichare produced by using the prepositional morphemes‘po-’ and ‘naj-’.

Adj:<> == AdjG<compar> == po_ "<>"<superl> == naj_ "<>".

NodeAdj_2 defines adjectives which realise twophonetic alternations during the inflection. The addi-tional base<root plur> is introduced to accountfor.

Adj_2:<> == Adj<plur undef> == "<root plur>" _i.

The example lexeme for the word short ’tesen’ isdefined as follows:

Tesen:<> == Adj_2<root> == tesen<root gend> == tjasn<root plur> == tesn.

LATEST TRENDS on COMPUTERS (Volume II)

ISSN: 1792-4251 615 ISBN: 978-960-474-213-4

Page 5: Representing Lexical Knowledge for Bulgarian … · Representing Lexical Knowledge for Bulgarian Inflectional Morphology in DATR VELISLAVA STOYKOVA Institute for Bulgarian Language

The reflexive-possessive pronouns use adjectivesinflectional rules [17]. They have in addition theagreement non-inflectional feature of gender andnumber and the feature of person, which are encodedas a lexical information and are defined at the relatedlexemes as invariables. The example lexeme for theword his ’negov’ is as follows:

Negov:<> == Adj<person> == third<number> == sing<gender> == masc<root> == negov<root gend> == negov.

The generated inflected forms are as follows:

Negov: <sing undef masc> == negov.Negov: <sing undef femn> == negova.Negov: <sing undef neut> == negovo.Negov: <plur undef> == negovi.Negov: <sing def_1 masc> == negovijat.Negov: <sing def_2 masc> == negovija.Negov: <sing def_1 femn> == negovata.Negov: <sing def_1 neut> == negovoto.Negov: <plur def_1> == negovite.

8 Conclusion

The analyzed application of Bulgarian inflectionalmorphology encodes the lexical information usinghierarchical linguistically motivated representationbased on the traditional notion of lexeme. It is a tabu-lar conceptualization inference presentation which ac-counts for the orthography principles, phonetic alter-nations, and morphological dependencies. The modelcan be used in lexicography for different types of con-text search since different word forms share the samemeaning. It can be useful, also, for automatic compi-lation of orthographic dictionaries. Further, it wouldbe useful to offer a syntactic interpretation.

References:

[1] Brown, D. (1998). Stem indexing and mor-phophonological selection in the Russian verb.In: R. Fabri, A. Ortmann, and T. Parodi, eds.,Models of Inflection, Tuebingen: Niemeyer,196–221.

[2] Cahill L. and Gazdar G. (1997). The inflectionalphonology of German adjectives, determinersand pronouns.Linguistics, 35(2):211–245.

[3] Cahill, L. and Gazdar, G. (1999). German nouninflection.Journal of Linguistics35.1, 1–42.

[4] Corbett G. and Fraser N. (1993). Network mor-phology: a DATR account of Russian nominalinflection.Journal of Linguistics 29, 113–142.

[5] Czuba, K. 1994. The DATR Web Pagesat Sussex. http://www.cogs.susx.ac.uk/lab/nlp/datr/datrnode49,file polish_n.dtr.

[6] Erjavec, T. (1992). Treatments of Slovene verbmorphology in inheritance models. MSc Thesis,Edinburgh: CCS, University of Edinburgh.

[7] Evans, R. and Gazdar, G. 1989a. Inferencein DATR. Fourth Conference of the EuropeanChapter of the Association for ComputationalLinguistics, 66–71.

[8] Evans, R. and Gazdar, G. (1989b). The seman-tics of DATR. In Anthony G. Cohn ed.Proceed-ings of the Seventh Conference of the Society forthe Study of Artificial Intelligence and Simula-tion of Behaviour, London, 79–87.

[9] Evans R. and Gazdar G. (1990). The DATR pa-pers. CSRP 139,Research Report, U. of Sussex.

[10] Evans, R., and Gazdar, G. (1996). DATR: Alanguage for lexical knowledge representation.Computational Linguistics22.2, 167–216.

[11] Gramatika na suvremennia bulgarski knizovenezik, tom. 2, Morphologia. 1983.

[12] Skoumalova, H. (1996). Czech hierarchical lex-icon, URL http://utkl.ff.cuni.cz/ skoumal/ fileczech_verb.html.

[13] Stoykova, V. (2002). Bulgarian noun – definitearticle in DATR. In D. Scott, ed.Artificial Intelli-gence: Methodology, Systems, and Applications.LNAI 2443, Springer-Verlag, 152–161.

[14] Stoykova V. (2004a). The definite article of Bul-garian adjectives and numerals in DATR. In C.Bussler and D. Fensel, eds.Artificial Intelli-gence: Methodology, Systems, and Applications,LNCS 3192, Springer-Verlag, 256–266.

[15] Stoykova V. (2004b). Modeling sound al-ternations of Bulgarian language in DATR.In: E. Buchberger ed.Proceedings of KON-VENS 2004, Shriftenreiche der Oesterreichis-chen Gesellschaft fur Artificial Intelligence,Band 5, Wien, 201–204.

[16] Stoykova V. (2009). The Inflectional Morphol-ogy of Bulgarian Definite Article in DATR. InZ. Vetulani, ed.Proceedings of the 4th Languageand Technology Conference, Poznan, 472–476.

[17] Stoykova V. (2010). Bulgarian Possessive andReflexive-possessive Pronouns in DATR. In: R.Trappl, ed.Cybernetics and Systems 2010, Aus-trian Society for Cybernetic Studies, 386–392.

[18] The DATR Web Pages at Sussex. (1977). URLhttp://www.cogs.susx.ac.uk/lab/ nlp/datr/datrnode49.html

LATEST TRENDS on COMPUTERS (Volume II)

ISSN: 1792-4251 616 ISBN: 978-960-474-213-4