treebanks, linguistic theories and...

112
30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Petya Osenova and Kiril Simov Sofia University “St. Kliment Ohridski”, Bulgaria Bulgarian Academy of Sciences, Bulgaria ESSLLI 2018 Treebanks, Linguistic Theories and Applications Annotation Schemes Lecture Two

Upload: others

Post on 20-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Petya Osenova and Kiril Simov Sofia University “St. Kliment Ohridski”, Bulgaria

Bulgarian Academy of Sciences, BulgariaESSLLI 2018

Treebanks, Linguistic Theories and ApplicationsAnnotation Schemes

Lecture Two

Page 2: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Plan of the Lecture

• Motivation for Text Annotation• General Architecture of the Annotation• Participants in the Annotation Flow• Characteristics of the Annotation

2

Page 3: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Motivation

Syntactically Annotated Texts enable the production of:• Linguistic Research• Parsers: high-quality; able to parse large amounts of texts

and texts in various languages making them structured• Information Extraction: getting relevant information from

texts, based not only on strings but also on structures• More precise Machine Translation• and much more

3

Page 4: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

General Annotation Architecture• Selection of texts (wrt to some task)• Selection of the Annotation Tool• Design of an Annotation Scheme wrt Linguistic Theory• Training Annotators• Annotation

• measuring interannotator agreement• juridication

• Re-Annotation• Validation

4

Page 5: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Participants in the Annotation Cycle

• Annotation Scheme Designer(s)• Annotation Tool creator/maintainer• Annotator Trainer• Annotator(s)• Juridicator• User(s)

5

Page 6: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

The BulTreeBank Experience

• The treebank as a ‘botanical garden’

• The parsebank with error checking as a ‘forest’

• The cashbank with automatically parsed texts as a ‘jungle’

6

Page 7: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

BulTreeBank Experience: Goals

• A set of Bulgarian sentences marked-up with detailed syntactic information

• A core set of sentences designated inside the treebank • Reliable partial grammar for automatic parsing of phrases in

Bulgarian• Software modules for compiling, manipulating and

exploring the treebank

7

Page 8: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Requirements

• Adequate representation of the linguistic facts – Theory dependency

• Adequate representation of partial and complete analysis– Easy transfer of the information

• Convenience for manual annotation– Minimal information input

8

Page 9: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Why Theory Dependency? (1)

• On a certain level of granularity the annotation scheme becomes very complicated to be processed consistently

• On a certain level of granularity some linguistic theory has to be exploited

• Two choices:– A new “annotation” linguistic theory to be developed, or– A well-established existing theory to be adopted

We have chosen HPSG as a base for our treebank

9

Page 10: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Why Theory Dependency? (2)

• HPSG is one of the major linguistic theories based on rigorous formal grounds

• HPSG allows for a consistent description of linguistic facts on every linguistic level: syntactic, semantic and others

• HPSG allows for different levels of generalisation and therefore enables different experts to work on different levels of analysis

• The formal basis of HPSG allows translation to other formalisms• There are universal HPSG principles that can be used to support the

work of the annotators

10

Page 11: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Set of Sentences

In the process of the treebank compilation it plays double role– Gold standard: this set has to cover the basic linguistic

phenomena in Bulgarian– HPSG Grammar development basis

Here we present and discuss the annotation scheme for the treebank

11

Page 12: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

HPSG Language Model

• Linguistic objects– Represented as directed graphs (feature structures)

• Sort hierarchy (linguistic ontology)– Represents the main types of linguistic objects and their

characteristics• Grammar (theory)

– HPSG Universal and Bulgarian Specific Principles– Bulgarian Lexicon

12

Page 13: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

HPSG Linguistic Objects

• The main linguistic object is of sort sign with three main attributes: PHON, SYNSEM and DTRS (for phrases)

• The co-reference (structure sharing) is the basic mechanism for ensuring the correct object structure

• The attribute DTRS determines the variety of constituent structures and the grammatical functions

13

Page 14: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

The Hierarchy of Phrasesheaded-phrase

head-complementhead-subjecthead-adjunct

head-sem-adjuncthead-pragmatic-adjunct

head-fillernon-headed-phrase

14

Page 15: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Constituency and Dependency (1)

• HPSG separates the linear order from the constituent structure

• Each constituent structure reflects the dependency between its immediate constituents

• The realization of the dependants follows the sequence:complements > subject > adjuncts

15

Page 16: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Constituency and Dependency (2)

16

Support-the

of

readinesssoldierof

includes

restriction

of

his freedom

Page 17: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Linguistic Object Representation

The representation of the linguistic objects (of sort sign) in the core set of sentences is based on:

• Context-free-like trees• Coreferencial relations over the trees• Node labels reflecting the synsem information

17

Page 18: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Word Order and Discontinuity

• Continuous realisation of daughters • Head dependants permutation

a constituent from an upper level of the hierarchy is realised between constituents of a lower level

• Mixture of two saturated constituentsthe constituents of two saturated phrases are mixed with each other

• External realisation of an inner constituentextraction

18

Page 19: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Head Dependants Permutation

Мъжът целува момичето.Man-the kisses girl-the.Целува момичето мъжът.Kisses girl-the man-the.Момичето целува мъжът.Girl-the kisses man-the.

Мъжът момичето целува.Man-the girl-the kisses.Момичето мъжът целува.Girl-the man-the kisses.

19

Page 20: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Head Dependants Permutation

man-the

girl-the kisses

20

Page 21: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

External Realisation of an Inner Constituent

На къщата той поправиOf house-the he repaired

покриваroof-the

He repaired the house roof.Of house-the repaired

he

roof-the21

Page 22: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Mixture of Two Saturated Constituents

малки го young it.MASC

моми беряхаgirls pick.IMPERF

young girls were picking ityoung it girls were picking

22

Page 23: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Realisation of the Dependants

With this case I was acquainted in time

23

With

this case

I

was

in time

acquainted

Page 24: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Linguistic Parameters• We rely on two basic assumptions:

– We use a domain-phenomena cross-classification, where the main syntactic domains are defined and the phenomena are analyzed

– We analyze the data according to the following HPSG-oriented criteria:

• the type of the sign, • headedness, • the typology of words and phrases, • the saturation condition

24

Page 25: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Domains: NP

• Bare Bulgarian NPs are always functionally complete (lexical category N)

• NP dependency structures: head-complement (NPC), head-adjunct (NPA)

• Classification criteria: ontological features (mainly for the named entities), ellipsis, substantivization, nominalization

25

Page 26: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Domains: VP (1)

• VPs are classified as lexical and phrasal• The lexical (V) includes:

– Bare verbs– Verbs with clitics– Da-constructions– Analytical verb forms– Elliptical verbs

26

Page 27: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Domains: VP (2)

• The phrasal category is recursive: – First, the verb with its full-fledged complement(s) forms a

head-complement phrase (VPC) – Then, the head-complement VP takes the subject and forms a

head-subject phrase (VPS)– The adjuncts are attached last and form head-adjunct (VPA) projections– Each extracted element without a structural parent is attached to a

head-filler phrase (VPF)– CL stands for a saturated verb phrase

27

Page 28: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Da Clause Example

Не може да не сме сеNot may to not to be.PL.1P REFLEX-PART

разминали на някой светофар.Pass.PRES.PERF on some traffic lights.

It is not possible that we have passed each other at some traffic light.

28

Page 29: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Da Clause

29

It is not possible that we have not passed each other at some traffic light.

Not possible

to

not

are

REFLEX-PART

passed each other

of

some traffic light

Page 30: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Domains: AP, AdvP, PP

• Lexical adjective (A) can be combined with possessive clitic• AP can be head-complement phrase (APC), and head-adjunct

phrase (APA)• Non-modified adverb is marked lexically (Adv) • AdvP can be head-adjunct phrase (AdvPA) and head-complement

with a gerund head (AdvPC)• PP is always a head-complement phrase

30

Page 31: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Coordination

Coordination is treated as a non-headed phrase with the following requirements:

• The conjuncts have to agree in their valency potential: Valency lists and Mod feature

• They can be underspecified with respect to the category: coord

31

Page 32: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Lexical Coordination

Тя умееше да приготви и натъкмиShe knows.IMPERF to prepare and adjust

всичко много добре. everything very well.

She knew how to prepare and adjust everything very well.

32

Page 33: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

LexicalCoordination

33

She knew how to prepare and adjust everything very well.

She

knew

to

prepare and adjust

everything very well

Page 34: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Clausal (1)Той щеше да стои в Копривщица и He would to stay in Koprivshtitsa (town) and

да се обучава в стрелба.to REFL.PART train in shooting.

He would stay in Koprivshtitsa and be trained in shooting.

34

Page 35: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Clausal

35

He would stay in Koprivshtitsa and be trained in shooting.

He

would

to stay in Koprivshtitsa to be trained

in shootingREFLEX-PART

and

Page 36: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Clausal (2)Момчето хвана стареца за ръка и Boy-the took old man-the for hand and

го поведе.him.MASC led.

The boy took the old man by the hand and led him.

36

Page 37: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Clausal

37

The boy took the old man by the hand and led him.

Boy-the

took old man

by hand

and

him led

Page 38: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

NP Internal

В тишина се носеше равният иIn silence REFL.PART spread even-the and

самотен глас на кукувицата.lonely voice of cuckoo-the.

The monotonic and lonely voice of the cuckoo was floating in the silence.

38

Page 39: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

NP Internal

39

The monotonic and lonely voice of the cuckoo was floating in the silence.

In silence

floatingREFLEX-PART

monotonic and lonely

voice of cuckoo

Page 40: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

NP Coordination

измамата и плагиатството са незаконниcheating-the and plagiarism-the are illegal

The cheating and the plagiarism are illegal.

40

Page 41: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

NP Coordination

41

The cheating and the plagiarism are illegal.

cheating and plagiarism

are illegal

Page 42: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Adjunct Coordination – Unlike Categories

Тя отиде дома бързо и She went home quickly and

развълнувана като младо момиче.еxcited.FEM as young girl.

She went home quickly and excited as a young girl.

42

Page 43: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

AdjunctCoordination (Unlike Categories)

43

She went home quickly and excited as a young girl.

She

went home quickly and

excited

as

young girl

Page 44: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Pragmatic constituents

Elements of the sentences structure with primarily pragmatic impact

Here we include different kinds of parenthetical expressions (of course, on the other hand, etc), vocative phrases

They are attached to the phrases which they modify pragmatically as adjuncts

44

Page 45: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Pragmatic Adjunct

Освен тебе, мале, никого нямам.Except you mother nobody not-have-I.

Besides you, mother, I have nobody.

45

Page 46: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Pragmatic Adjunct

46

Besides you, mother, I have nobody.

Besides you mother nobody do not have-I

Page 47: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Preference Rules• Coordination

Prefer constituent coordination to ellipsis!• Ellipsis

If in the sentence there is an anchoring element for the ellipsis restoration, prefer it to the discourse one!

• Modal verbsIn sentences with two readings possible: personal and impersonal,

prefer the personal reading!

47

Page 48: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Phenomena (1)Unexpressed Elements:• Pro-dropness

[kazah mu] [da prochete knigata]‘I told him to read the book.’

• Ellipsis[Ivan pie bira,] [a Maria vino]‘John drinks beer, but Maria wine.’

• Frame alternation[kazah mu] [da chete]‘I told him to read.’

48

Page 49: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Phenomena (2)

Co-referential Relations:• Agreement• Binding• Anaphora resolution• Definiteness• Control

49

Page 50: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Phenomena (3)• Relative clauses• Secondary predication

Type-shifting:

• Substantivization• Nominalization

50

Page 51: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Secondary Predication

Планинският вятър духаше вече Mountain wind blow.IMPERF already

доста свеж и силен.very fresh and strong.

The mountain wind was already blowing quite fresh and strong.

51

Page 52: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Secondary Predication

52

The mountain wind was already blowing quite fresh and strong.

mountain wind

blowing

already quite

fresh and strong

Page 53: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Substantivization

Заговори бързо, защото младите Start-to-speak.PAST quickly because young-the

приближаваха.come-near.PAST

He spoke quickly, because the young were coming.

53

Page 54: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Substantivization

54

He spoke quickly, because the young were coming.

Spoke quickly

because

young

coming

Page 55: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Verb Ellipsis

Вол се връзва за рогата, Ox REFLEX-PART tied for horns, а човек за езика.but person for tongue

Ox is tied on its horns while person [is tied] by its tongue.

55

Page 56: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Verb Ellipsis

Ox is tied on its horns while person [is tied] by its tongue.

Ox

REFLEX-PART tied

for horns person for tongue

56

Page 57: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Verb Ellipsis – 2 (with Polarity Change)

Нямаше заплаха, а само радостWas-no thread, but only joy

There was no thread, but only joy.

57

Page 58: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Verb Ellipsis – 2 (with Polarity Change) There was no thread, but [there was] only joy.

change from negative to positive polarity for the elided verb

there was no thread

only joy

there was

58

Page 59: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Noun Ellipsis

Българското и гръцкото правителствоBulgarian-the and Greek-the government

The Bulgarian and the Greek governments.

59

Page 60: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Noun Ellipsis

The Bulgarian and the Greek governments.

Bulgarian Greek government

and

60

Page 61: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

More Ellipses

Радиацията в Косово била Radiation-the in Kosovo was-RENARRATIVE

по-слаба от Софияmore weak from Sofia

They say that radiation in Kosovo was weaker than in Sofia.

61

Page 62: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

More Ellipses

They say that radiation in Kosovo was weaker than [the radiation] in Sofia.

Radiation was

in Kosovo weaker

from(than)

Sofia62

Page 63: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process (1)

• Sentence extraction• Automatic pre-processing

– Morphosyntactic tagging– Part-of-speech disambiguation– Partial parsing– Adding syntactic information from the HPSG grammar (Ideally

automatic)• Manual annotation

– XML mark-up and constraints over XML documents

63

Page 64: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process (2)We use two systems:• CLaRK system for:

– Language resource creation, management and exploration– Minimisation of human work– Facilities for semantic validation of the information in language

resources (content checking in addition to structure checking)• TRALE system for:

– Representation of the HPSG Grammar– For logical inference

64

Page 65: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process Example (1)Sentence extraction (after morphological analysis): The pretty

girl listened to the boy’s short story.

65

pretty girl listened short story of boy

Page 66: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process Example (2)

Automatic pre-processing (after the partial parsing)

66

pretty girl

listened

short story

of boy

Page 67: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process Example (3)Adding syntactic information from the HPSG grammar (1)

67

pretty girl listened

short story of boy

Page 68: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process Example (4)Adding syntactic information from the HPSG grammar (2)

68

pretty girl listened

short story

of boy

Page 69: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Annotation Process Example (5)Manual annotationThe intersection of all possible analyses is calculated:

69

pretty girl

listened

short story of boy

The information from the possible analyses, which is outside the intersection, is encoded as constraints over this document

Page 70: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

XML Implementation

Two basic principles accepted during the DTD design:– the XML tree model is used to represent as much as possible from

the structure of the sentence analysis– the order of the lexical elements corresponds to the word order in

the sentence – no empty elements are inserted in the structure (no traces)

70

Page 71: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

XML Elements

• Structural elements corresponding to the linguistic domains, additionally classified wrt their sign and constituent structure:– Lexical level: V, N, A, … elements– Phrasal level: VPC, VPS, VPA, NPC, NPA, …

• Discontinuous elements: DiscA, DiscE, …

71

Page 72: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Example of a Partial DTD Declaration <!ENTITY % vpcomp

"N|Subst|Nomin|Pron|NPC|NPA|A|APA|APC|PP|Adv|AdvPA|%CLALL;" >

<!ENTITY % vpchead "VD-Elip|V|V-Elip|Participle|Verbalised" >

<!ELEMENT VPC (( (%vpchead;), (%vpcomp;)+ ) |( (%vpcomp;)+, (%vpchead;), (%vpcomp;)+ ) |( (%vpcomp;)+, (%vpchead;) )

) >

72

Page 73: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Co-reference Relations

• Co-reference is represented via variables• Each variable is declared as a value of an ID attribute in the

CoIndex element• The variable is used as a value of an IDREF attribute• Two elements are co-referred by pointing to the same

variable• Each co-referred relation has a type

73

Page 74: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Additional Levels

Except for the lexical and phrasal level, we also represent:• Multi Lexical level:

– for the result of lexical rules, and– for idiosyncratic expressions

• Discourse level:– for representation of co-reference relations to other sentences

within the discourse: InDiscourse and OutDiscourse variables

74

Page 75: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Bulgarian Grammar: Introducing the Background (1)

• Towards the design and the application of:– wide-coverage grammars, based on– deep linguistic knowledge

• In HPSG there already exist quite extensive implemented formal grammars– English (Flickinger 2000), German (Müller and Kasper 2000),

Japanese (Siegel 2000, Siegel and Bender 2002)– Use of Minimal Recursion Semantics (Copestake et. al 2005).

75

Page 76: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Introducing the Background (2)

• The international initiative LinGO Grammar Matrix (Bender et. al 2010; Bender et. al 2002)

• Customized grammars on: Norwegian, French, Korean, Italian, Modern Greek, Spanish, Portuguese (http://www.delph-in.net/index.php?page=3)

• Open source software system, which supports the grammar and lexicon development – LKB (Linguistic Knowledge Builder)

(http://wiki.delph-in.net/moin/LkbTop)

76

Page 77: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Grammar Matrix Architecture (1)

• Intended as a typological core for initiating the grammar writing on a specific language (Bender et. al 2002)

• Provides a customization web interface (Bender et. al 2010)• Aims at:

– common basis for comparing various language grammars to speed up the process of the grammar development

77

Page 78: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Grammar Matrix Architecture (2)

• Supplies the skeleton of the grammar – the type hierarchy with basic types and features

• Based on the experience with several languages (predominantly English, German and Japanese)

• Aims at semantic modeling of a language:– referential entities and events– semantic relations– semantic encoding and contribution of the linguistic phenomena

78

Page 79: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Example

individual := semarg & [ SORT semsort ].event-or-ref-index := individual.

ref-ind := index & event-or-ref-index & [ PNG png ].

79

Page 80: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Challenges (1)

• Languages like Bulgarian grammaticalize a lot of linguistic phenomena. Thus, the most common level of description would be the morphosyntactic level rather than the semantic one

• Grammar Matrix is implemented in accordance with some version of the HPSG theory – thus it implies certain decisions with respect to the possible analyses

80

Page 81: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Challenges (2)

• The Grammar Matrix predefines some phenomena too strictly, for other – it gives possibilities for generalizations

• Depending on the preference of the modelled linguistic phenomena, the grammar developers might have to extend and/or change the core grammar

• How much information to be encoded within the grammar, and which steps to be manipulated outside the grammar

81

Page 82: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Localization for Bulgarian

• The Multilingual Testset

• The Language Specific Phenomena

82

Page 83: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

The Multilingual Testset

• 100 sentences, which in the Bulgarian translated set became 178• The translated set incorporated also a bunch of language

specific phenomenaResult:• 193 positive sentences• 20 negative sentencesComparable to the testset for Portuguese in the first phase of the

grammar development

83

Page 84: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

81

1. Aбрамс знаеше, че вали (дъжд).Abrams knew that rains-it (rain).2. Aбрамс знаеше, че е валяло (дъжд).Abrams knew that is-rained-it (rain).

Abrams knew that it rained.

91Aбрамс възнамеряваше/се канеше да лае/да излае.Abrams(masc) was-intending/reflexive was-about to bark/to bark one or more times.

Abrams intended to bark.

101Aбрамс искаше Браун да лае/да излае.Abrams wanted Browne to bark/to bark one or more times.

Abrams intended Browne to bark.

111Всяка котка лаеше/излая.Every(fem) cat(fem) was-barking/barked one or more times. Every cat barked.

121Всяка котка преследваше някакво куче.Every(fem) cat(fem) was-chasing some(neut) dog(neut).

Every cat chased some dog.

84

Page 85: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Core Phenomena

• Complementation • Modification • Coordination • Agreement • Control • Quantification • Negation

• Illocutionary force• Passivization • Nominalization • Relative clauses • Light verb constructions • Other

85

Page 86: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Type Hierarchy

• The types in the initial grammar are 297 due to the small lexicon

• Examples:n_-_cf_le := basic-common-noun-intr & [ SYNSEM.LOCAL.AGR.PNG.GENDER feminine].nom-pers-pro-noun-le := reg-pers-pro-noun & [SYNSEM.LOCAL.CAT.HEAD.CASE nom ].

86

Page 87: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Some Specificities

• Bulgarian is a pro-drop language• Bulgarian verbs encode aspect lexically• More Bulgarian verb synonyms have been provided to the

English oneAbrams handed the cigarette to Browne дам (give), подам (pass), връча (deliver), предам (hand in)

• Bulgarian has clitic counterparts to the complements as well as a clitic reduplication mechanism

87

Page 88: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

The Language Specific Phenomena: Semantics vs. Morphosyntax• The information has to be often split between the semantic

phenomenon and its realization

• For example, the adjectives, participles, numerals happen to have morphologically definite forms, while the definiteness marker is not a semantic property of these categories.

88

Page 89: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Definiteness

The event selects for a semantically definite[SYNSEM.LOCAL.HOOK.INDEX.DEF+]

But morphologically indefinite noun [SYNSEM.LOCAL.AGR.DEF-]Example:

старото куче ‘old-the dog’

89

Page 90: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Semantic Information

Grammar Matrix provides several possibilities to get the semantic information:• For tense and mood the aggregated one is chosen AGR.E.TENSE• While for aspect – the separated encodings

(HEAD.TAM.ASPECT) – (Tense, Aspect, Mood)• The aggregated way is a better choice for unified

syntactic-semantic analysis • While the separated representation leaves out an opportunity for

different manipulation of syntactic and semantic contribution

90

Page 91: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Generalized Types

Very often, in Bulgarian the generalization cannot be kept at higher levels

because of

The variety in the morphosyntactic behaviour types within the Bulgarian constructions

91

Page 92: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Copula Constructions

Adjectives, adverbs and prepositions have an event index, they cannot share the same generalized type, because:• Adjectives structure-share their PNG (person, number and

gender) characteristics with the copula’s XARG – the subject• The adverbs have to be restricted to intersective modifiers when

taken as complementsAll these heads raise their semantic index to the copula, which

is semantically vacuous itself

92

Page 93: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Noun Phrases• The nouns, however, have a referential index• No index is raised from the noun complement up to the copula• 8 lexical types are introduced:

• Two for present and past copula forms (*Am-I there; OK: Was-I there). • Each of the two then is divided into four subtypes depending on the

complement – present copula –noun; present copula – adjective

93

Page 94: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Relatively Free Word Order

Most of the rules include all the possible orders in spite of the canonical readings

Example (rules for):head-modifier and modifier-head clitic-head and head-clitic

BulTreeBank is used as a discriminative tool, because it comprises the canonical and most preferred analysis per sentence

94

Page 95: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Bulgarian Argument-Related Clitics

• Clitics are viewed as lexical projections of the head (i.e. operated by special rules)

• While the regular forms are treated as head arguments (complements) (i.e. operated by head-complement principles)

• The clitic does not contribute its separate semantics, because it is not a full-fledged complement

95

Page 96: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

The Rich Morphology

Two ways of morphology incorporation were possible:

• Re-design of the whole systematic and unsystematic morphology within the grammar, which would be a linguistically sound, but time-consuming step

• Opportunistic - the inflection classes of the morphological dictionary for Bulgarian (Popov et. al 2003) have been transferred into the grammar

96

Page 97: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Morphological Rules

• The rich morphology requires 2600 rulesThus:

The morphological work has been suppressed in the name of syntactic and semantic modeling

But:A large lexicon could not operate without the complete set

of the morphosyntactic types and rules

97

Page 98: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Encoding of Morphological Rulesima_v1 := v_there-is_le & [ STEM < "има" >, SYNSEM.LKEYS.KEYREL.PRED "има_v_1_rel" ].

ima_v1 := v_there-is_le & [ STEM < "има" >, SYNSEM [ LKEYS.KEYREL.PRED "има_v_1_rel",

LOCAL.CAT.HEAD.MCLASS [ FIN-PRESENT finite-present 101, FIN-AORIST finite-aorist-080, FIN-IMPERF finite-imperf-025, PART-IMPERF participle-imperf-024, PART-AORIST participle-aorist-095 ] ] ].

98

Page 99: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) 99

it rained

Page 100: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) 100

buy him

house

Page 101: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) 101

Evaluation

• Within the system [tdsb] (Oepen 2001)• The average of distinct analyses is 3.73• The ambiguity of analyses is mainly due to the following

factors: 1. Morphological homonymy of the word forms2. More than one possible word order3. More than one possible attachment 4. Competing rules in the grammar

Page 102: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) 102

BulTreeBank for the Selection of the Correct ParsesTask: To use the analyses in BulTreeBank for extracting the necessary discriminating properties for disambiguation

Motivation: from 654 analyses in BURGER, only 81 analyses are unique. 348 analyses have been rejected, which is more than 50 %. Hence, a disambiguation mechanism is needed.

Page 103: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) 103

Main Sources of Ambiguity

• Morphological ambiguity• Various places of attachment• Neutral vs. focused ordering of constituents• Proliferation of several competing rules for the same item

Page 104: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Ambiguities (1)

• Morphological ambiguity:1. Абрамс даде цигара на Браун. Abrahms gives/gave cigarette to Brown. Abrahms is giving/gave a cigarette to Brown.

• Various places of attachment:2. Котката е в градината. Cat-the is in garden-the The cat is in the garden.

104

Page 105: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Ambiguities (2)

• Neutral vs. focused ordering of constituents3. Онова куче преследваше Браун. That dog was-chasing Brown. That dog was chasing Brown.

• Proliferation of several competing rules for the same item 4 (1). Абрамс даде цигара на Браун.

Abrahms gives/gave cigarette to Brown. Abrahms is giving/gave a cigarette to Brown.

105

Page 106: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Implementation

A common target format to represent the important knowledge from both treebanks:• The analyses from both treebanks are transformed into a

new format;• The knowledge within the new representations is unified on

the basis of correspondences rules;• The parse selection is done on the basis of comparing both

unified representations.

106

Page 107: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

New Format

Sentencew1 w2 w3 … wn

Lexical elementswk:pos list-of-categories

Head-dependent pair<wi:posi, wj:posj> list-of-categories

107

Page 108: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Example: Output Representation

• Lexical elementsкога (when):1 (ADV)лаеше (barked):2 (FINITE-IMPERF-THIRD-SG00476-ORULE)кучето (the dog):3 (THIRD_SG_NEUTER_NOUN_IRULE, DEF THIRD_SG_NEUTER_NOUN_ORULE)

• Head-dependent pair<кога (when):1 лаеше (barked):2> (MOD-INT-OTHER-PHRASE)<лаеше (barked):2 кучето (the dog):3> (HEAD-SUBJ)

108

Page 109: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Results

• 86 % of the correct analyses produced by BURGER were successfully selected by the discrimination properties extracted from BulTreebank

• Therefore, our idea of using BulTreebank as a discriminator of the analyses is justified

109

Page 110: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Problematic Cases

• Coordination (BulTreeBank accepts only one interpretation among more possible ones)

• Complementation in NP (in BulTreeBank some dependants are viewed as modifiers, while in BURGER – as complements)

110

Page 111: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Conclusions

The presented annotation scheme is designed to support the building of the core set of sentences

The annotation scheme combines pure linguistic features with metafeatures like recursivity, heaviness, specificity

The scheme is robust and annotator-friendlyThe annotated data could be used for selection of the correct

analyses in creation of a grammar-based version of the treebank

111

Page 112: Treebanks, Linguistic Theories and Applicationsesslli2018.folli.info/wp-content/uploads/ESSLLI2018Tree... · 2018-08-14 · 30th European Summer School in Logic, Language and Information

30th European Summer School in Logic, Language and Information (6 August – 17 August 2018)

Some References

Kiril Simov, Petya Osenova, Alexander Simov, Milen Kouylekov. Design and Implementation of the Bulgarian HPSG-based Treebank. Special Issue on Treebanks and Linguistic Theories. Research on Language & Computation. Springer Science+Business Media B.V. Volume 2, Number 4.

Petya Osenova. Localizing a Core HPSG-based Grammar for Bulgarian. In: Hanna Hedeland, Thomas Schmidt, Kai Worner (eds.) Multilingual Resources and Multilingual Applications, Proceedings of GSCL 2011, ISSN 0176-599X, Hamburg, pp. 175–180.

112