grammar development in xle - unideb.huweb.unideb.hu/...linguistics_2012/.../morphology1.pdf ·...

24
Gábor Csernyi Department of English Linguistics University of Debrecen [email protected] http://ieas.unideb.hu/csernyi

Upload: others

Post on 21-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Gábor Csernyi Department of English Linguistics

    University of Debrecen [email protected]

    http://ieas.unideb.hu/csernyi

  • „The quest for an efficient method for the analysis and generation of word-forms is no longer an academic research topic, although morphological analyzers still remain to be written for all but the commercially most important languages.”

    (Karlsson & Karttunen 1997)

    Morphological processing (including analysis and generation as well) is an important component of many (sub)fields of natural language processing: text-to-speech systems, machine translation, information retrieval, etc.

    2

  • A linguistic field concerned with the study of the internal structure of words.

    What is a word? A finite sequence of letters built over a finite set of symbols (i.e. the alphabet).

    Word form vs. lexeme. opens, opener OPEN

    3

  • Form variations of the same word; Comes with changes is grammatical features

    number [singular / plural], e.g. house - houses;

    person [1st / 2nd / 3rd], e.g. think - thinks;

    tense [past / present ( / future)], e.g.: call - called;

    gender [feminine / mascular /neuter]

    case [accusate / dative / genitive / locative / etc.]

    4

  • Creating new words teach [V] teacher [N] record [V] record [N]

    5

  • Words are built from morphemes. Morpheme:

    the smallest meaningful units of language;

    carries lexical meaning, or indicates grammatical features (present, 3rd person, singular)

    6

  • Types of morphemes: Free: can stand alone as a word (e.g. cat, book). Bound: must be attached to another morpheme (root/base/stem),

    carries no (lexical) meaning on its own (e.g. -ed, -s, un-).

    Another classification of morphemes: Lexical morphemes:

    lexeme – all the forms with the same meaning; lemma – the form that conventionally represents the lexeme (source: Wikipedia) open-class

    Grammatical morphemes (function words) carry grammatical meaning/function (source: Wordnet) closed-class

    7

  • Root: „The primary lexical unit of a word, which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents ” (source: Wikipedia)

    Stem: the unit to which an inflectional ending is added. misconducts stem: misconduct inflection(al suffix): -s root: conduct derivational prefix: mis-

    8

  • Base: an element to which affixes (inflectional/derivational) are added. misconducts

    base: misconduct

    inflection(al suffix): -s

    base / root: conduct

    derivational prefix: mis-

    9

  • Any process that whereby a new form is produced from a base. Characterising factors: Productivity, regularity

    10

  • Attaching an affix to a base. Affix

    prefix: to the beginning of the base e.g.: undo

    suffix: after the base e.g.: dogs

    infix: inside the base ?passer-by

    circumfix: around the base

    11

  • Forming compounds by putting two or more words together. Note: orthography issues words separated (white spaces between words)

    e.g.: red wine words hyphenated

    e.g.: time-consuming words joined without white space

    e.g.: bedtime

    12

  • Forming new word that is different in terms of grammati-cal category and/or meaning compared to the original base. Note: conversion (i.e. zero derivation) is a subtype. Characteristics: rule-based process productive (certain derivational affixes connect

    to certain base forms) output open to other (derivational) processes

    e.g.: product productive productivity 13

  • Cliticisation e.g.: they’ve

    Reduplication e.g.: ?very very expensive

    Internal change e.g.: sink - sank – sunk

    Suppletion e.g.: good - better/best 14

  • Clipping, abbreviation, acronymy advertisement ad; manuscript MS; frequently asked questions FAQ

    Blending e.g.: motor + hotel motel

    Backformation e.g.: beggar beg 15

  • Analysis: identifying morphemes and deriving morphosyntactic features of word forms. e.g.: elephants: elephant + s elephant +Noun +Pl > morphological analyzer/parser

    Generation: to provide all possible word forms of a root/stem. > morphological generator

    16

  • Segmentation How to identify the morphemes in a word form, and how to segment a word form?

    Morphographemics How to identify and account for alternation forms (spelling changes, e.g. carry vs. carried)?

    Morphotactics Are there any constraints on the position/order of morphemes (to constitute a valid word form)?

    17

  • Basic techniques: Lexicon with full forms:

    all possible word forms are stored in the lexicon; pattern matching as lexical lookup Problems: large lexicon, slow processing; redundancy issues; language creativity and productivity?

    Lemma lexicon: lexicon for lemmas (+ morphosyntactic interpretaions) lexicon for affixes: affixes + morphotactics lexical lookup: finding lemma + affix(es) sequences Problems: morphographemics, suppletion.

    18

  • Two level morphology: Introduced by Kimmo Koskenniemi (Koskenniemi (1983) Two-level morphology : a general computational model for word-form recognition and production.) Features: language-independent can be used for analysis and generation as well two parallel leveles: surface level, lexical level symbol-to-symbol correspondence between the two levels

    lexical level: elephant +Noun +Pl surface level: elephants

    transducers responsible for mapping between surface and lexical level main components: rule component with two level rules; lexical

    component (lemmas and affixes in the form of continuation classes)

    19

  • Approaches making use of the two-level morphology: Finite state morphological transducers:

    popular because of its efficiency in speed and small size; both for analysis and generation.

    Unification-based morphology:

    lexicon: allomorphs of the same base form; sequences of affixes analyzed as a whole (Gestalt-view); two-level morphology used for accounting for spelling

    changes; also, continuation classes. unifiability checks

    20

  • Machine learning approaches to morphological processing

    Making generalizations; finding out the morphological rules automatically by the machine.

    Two types:

    Supervised: morphological rule learning with the help of examples.

    Unsupervised: structure identification of word forms in large, unannotated texts.

    21

  • 1. Humor (High-speed Unification Morphology)

    unification-based system

    developed by Morphologic

    22

  • 2. HunMorph

    open-source tool

    C/C++ runtime layer toolkit as an extension to MySpell (reimplementation of the Ispell spellchecker)

    affix stripping methods

    language-independent runtime environment

    Language-specific dictionaries (dictionary, affix)

    23

  • 3. Xerox FST for Hungarian

    24