cs11-737: multilingual natural language processing

Post on 03-Dec-2021

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS11-737: Multilingual Natural Language Processing

Yulia Tsvetkov

Morphological Analysis and Inflection

What is a word

Bob’s handy man is a do-it-yourself kinda guy, isn’t he?

Morphology

The study of the formation and internal structure of words

Morpheme

Image from Lori Levin and David R. Mortensen’s draft book “Human Languages for Artificial Intelligence”

Words are made of morphemes

Bob’s handy man is a do-it-yourself kinda guy, isn’t he?

freemorpheme

boundmorphemes

Example by Austin Matthews

Morphological processes

● concatenation● affixation = stem+affix

○ prefix○ suffix

● non-concatenative affixation○ infix

● compounding = stem+stem

stemprefix + stemprefix + stem + suffix=circumfixation

=

Tagalog

● Tagalog○ stem - bundok ○ singular - mabundok○ plural - mabubundok○ gloss - ‘mountainous’

Example from Lori Levin and David R. Mortensen’s draft book “Human Languages for Artificial Intelligence”

Arabic, Chinese

● Arabic○ root and pattern morphology

● Chinese○ compound words

Morphological functions

● Derivational morphemes ○ bound morphemes used to create new words ○ is these affixes are attached to a new base, the

resulting combination yields a word with a new meaning

○ often derived word belongs to a different syntactic class

● Inflectional morphemes○ bound morphemes used to mark grammatical

distinctions○ change the form but not POS tag or the key meaning

of the word

=

Interlinear glossed text (IGT)

● https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Interlinear glossed text (IGT)

● https://www.eva.mpg.de/lingua/resources/glossing-rules.php

Types of morphological categories and functions

1. Nounsa. NUMBER: Singular, Dual, Pluralb. GENDER (natural & grammatical): Masculine, Feminine, Neuter (Animate, Vegetable; AND AGREEMENTc. DEFINITENESS: Definite, Indefinited. POSSESSION: 1st, 2nd, 3rd; Singular & Plurale. NOUN CLASS (Grammatical gender): Declension types I, II, III, etc.f. CASE PARADIGM (DECLENSION)

2. Adjectivesa. RELATIONAL : QUALITATIVE : DEFECTIVEb. DEGREE: Comparative and Superlative

3. Verbsa. TRANSITIVITY: Transitive, Intransitiveb. ASPECT: Perfective, Imperfectivec. TENSE: Distant Past, Past, Present, Future, Distant Futured. VOICE: Active, Passive e. MOOD: Indicative, Imperative, Subjunctivef. Conjugation Class: I, II, III Conjugations and Conjugations: 1st, 2nd, 3rd Person, Sg, Pl Agreement

Morphological typology

● Isolating or Analytic○ Vietnamese, Chinese, English

● Synthetic○ Fusional or Flexional

■ German, Greek, Russian■ Templatic: Hebrew and Arabic

○ Agglutinative or Agglutinating■ Finnish, Turkish, Malayalam, Swahili

○ Polysynthetic ■ Inuit, Yupik

(Cettolo, Girardi, & Federico, 2012)

Type-token curves

Why is rich morphology a challenge for NLP?

● High type-token ratio due to the large variety of grammatical features expressed with morphology

○ This leads to the lexical sparsity and out-of-vocabulary words

● In language generation long-range relations between words need to be enforced for modeling morphological agreement

○ This leads to agreement errors

● Morphological properties vary across languages and language families, and mapping of morphological features across languages is a challenge

○ This is exacerbated by the variability of morphological rules and irregularities (e.g. dance → danced → danced but eat → ate → eaten)

○ This leads to problems in transfer learning, translation errors, and biases in translation

Types of morphological processing

● Analysis○ morphological parsing○ morphological segmentation

● Generation○ inflection generation ○ paradigm completion

● Acquisition of inflectional morphology

Morphological analysis

Morphological analysis with FSTs

Morphological analysis with RNNs

Canonical segmentation

a. surface segmentation: achievability → achiev+abil+ity

b. canonical segmentation achievability → achieve+able+ity

1. Character bidirectional GRU encoder with attention

2. GRU decoder produces output characters3. Neural reranker for segments to identify

canonical segments

Evaluation of morphological analysis

● Error rate● Edit distance ● Morpheme F1

UniMorph

https://unimorph.github.io

1. Inflection generation

2. Paradigm completion

Morphological generation

The SIGMORPHON shared tasks

● Cross-lingual transfer for morphological inflection● Morphological analysis in context● Morphological paradigm completion

Morphological inflection generation

Paper for class discussion

● https://www.aclweb.org/anthology/D19-1091.pdf● Read the paper● Provide critique to a part of the paper (e.g., focusing on an individual

component of proposed model architecture or a part of experimental setup)● Propose directions for follow-up work

top related