verb valency frame extraction using morphological and syntactic features of croatian krešimir...

Verb Valency Frame Extraction Using Morphological and Syntactic Features of

Croatian

Krešimir Šojat, Željko Agić, Marko Tadić

Department of Linguistics, Department of Information SciencesFaculty of Humanities and Social Sceinces, University of Zagreb

{ksojat, zagic, marko.tadic}@ffzg.hr

FASSBL 7 ConferenceDubrovnik, Croatia

2010-10-05

Overview What?

extraction and semi-automatic construction of verb valency frames

How? rule-based extraction procedure run on the

Croatian dependency treebank manual assignment of tectogrammatical

functors inference of rules for assigning functors to

unseen text Why?

creation of treebank-based verb valency lexicon

enhancement and enrichment of existing resources

Valency frames valency frame extraction means to detect

all possible environments of particular verb as found in the treebank

such an approach aims at fast construction of valency frames

extraction is automatic, no elements of frames added manually by human annotators

such automatically acquired verb valency lexicon can serve as a basis for further enrichment and enhancement of manually constructed resources, either existing or constructed from scratch

The treebank Croatian Dependency Treebank (HOBS)

follows the guidelines of the Prague DT taken from the Croatia Weekly 100 kw sub-

corpus of the Croatian National Corpus (HNK) XCES-encoded up to the word level sentence-delimited, tokenized, manually

lemmatized and MSD-tagged serves as the morphological layer of the treebank

annotated on the syntactic layer approximately 2.700 sentences, 67.000 tokens manually assigned syntactic functions ca 1.300 sentences double-checked and used in

this experiment

The treebank

HR Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj.

ENThe Union has already arranged some measures in order to help Croatia.

Extraction algorithm the algorithm aims at extraction of verb

valency frame instances for each verb in the treebank sample, it

descends one level down the dependency tree to

retrieve subjects (Sb), objects (Obj), adverbs (Adv) and nominal predicates (Pnom)

Two levels down to retrieve tokens from the previous step introduced by subordinate conjunctions (AuxC) or prepositions (AuxP)

Extraction algorithm algorithm illustration

dogovorila (dogovoriti Pred) [Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css

AuxC]

Extraction algorithm the first version retrieved predicates only and

was expanded to retrieve all the verbs from the treebank sample

algorithm adapted to retrieve any verbs found in the dependency structure, regardless of their respective analytical functions and position within the dependency trees

the adaptation itself is implemented in order to raise the recall of the algorithm, while still maintaining its precision by not changing the simple set of descending rules

i.e. to retrieve as much verbs as possible given the limited size of the treebank sample used in the experiment

CCCCyyyyLocationyyyy-mm-dd

Extraction algorithm the verb “imati” (Vmn) is annotated as

object (Obj)

Extraction algorithm Thus, from each sentence the number of

extracted frames correspondes to the number of verbs: one frame for the main clause that captures

the whole syntactic structure of the sentence frames extracted from dependent clauses

naglasio (naglasiti Vmps-sma Pred)[Mikuška Np-sn Sb] [kako->imati Css AuxC->Obj]

imati (imati Vmn Obj)[stanovništvo Ncnsn Sb] [korist Ncfsa Obj] [od->projekta Spsg->Ncmsg AuxP->Adv]

[kroz->ekoturizam Spsa->Ncmsa AuxP->Adv]

Functor assignment In order to annotate verbal frames we used

a set of 5 argument functors and functors for 32 free modification functors: Argument functors: ACT, PAT, ADDR, ORIG, EFF

Temporal functors: TWHEN, TFHL, TFRWH, THL, THO, TOWH, TPAR, TSIN, TTILL

Locative and directional functors: DIR1, DIR2, DIR3, LOC

Functors for causal relations: AIM, CAUS, CNCS, COND, INTT

Functors for expressing manner: ACMP, CPR, CRIT, DIFF, EXT, MANN, MEANS, REG, RESL, RESTR

Functors for specific modifications: BEN, CONTRD, HER, SUBS

936 frame instances were manually annotated for 424 different verbs

Results valency frame frequency across verb lemmas

Verb Frequencybiti 188

imati 23reći 15

dobiti 12raditi 10kazati 9

pokazati 8postati 8vidjeti 8dati 7

raditi (en. to work, to do)

Valency frameFrequenc

yACT PAT 2

ACT CRIT LOC THL 1ACT MANN TWHEN 1ACT MEANS TWHEN 1

ACT PAT TSIN 1dati (en. to give)

Valency frame FrequencyACT ADDR PAT 4

ACT ADDDR PAT 1ACT ADDR AIM PAT 1

ACT PAT 1

Results frequency of verb valency frames, i.e. n-

tuples of tectogrammatical functors

Frame Count PercentACT PAT 250 26.71

PAT* 157 16.77ACT PAT TWHEN 30 3.21ACT MANN PAT 23 2.46ACT ADDR PAT 20 2.14

ACT LOC 20 2.14ACT LOC PAT 20 2.14

MANN PAT 17 1.82ACT CAUS PAT 16 1.71

ACT MANN 13 1.39LOC PAT 12 1.28

ADDR PAT 11 1.18

Other 347 37.07

Results

CCCCyyyyLocationyyyy-mm-dd

frames annotated with MSD, analytical functions and tectogrammatical functors

djelovati(djeluje Pred)

[ neozbiljno Neozbiljno Rnp Adv MANN ][ odustajanje odustajanje Ncnsn Sb ACT ]

osloboditi(oslobodili Pred)

[ ACT ] [ nikada Nikada Rt Adv THL ][ zloduh zloduha Ncmsg Obj PAT ]

postati(postali Pred)

[studij studiji Ncmpn Sb ACT][fakultet fakultet Ncmsn Obj PAT]

zaustaviti(zaustavio Atr)

[ ACT ] [ oni ih Pp3-pa--y-n-- Obj PAT ][ dolina u->dolini Spsl->Ncfsl AuxP->Adv

LOC ]

Results Distribution of (MSD, analytical function)

pairs across tectogrammatical functors

serves as basis for defining functor assignment rules from MSD and analytical function

ACT (Actor) PAT (Patient) LOC (Locative)

A-fun MSD % A-fun MSD % A-fun MSD %

Sb Ncmsn 14.91 Obj Ncfsa 11.25 (AuxP) Adv (Spsl) Ncfsl 21.88Sb Np-sn 13.50 Obj Ncmsa 9.18 (AuxP) Adv (Spsl) Ncmsl 16.41Sb Ncfsn 12.87 Pnom Ncmsn 5.69 (AuxP) Adv (Spsl) Npmsl 10.16Sb Ncmpn 9.89 Obj Ncmpa 4.53 (AuxP) Adv (Spsl) Ncnsl 8.59Sb Npfsn 5.65 Obj Vmn* 4.40 (AuxP) Adv (Spsl) Npfsl 8.59Sb Pi-mpn--n-a-- 4.71 Obj Ncnsa 3.75 (AuxP) Adv (Spsl) Ncmpl 5.47Sb Ncfpn 3.30 Obj Ncfpa 3.49 (AuxP) Adv (Spsl) Ncfpl 3.91Sb Ncnsn 2.98 Pnom Ncfsn 2.72 Adv Rl 3.13Sb Pi-msn--n-a-- 2.51 (AuxC) Obj (Css) Vmip3s 2.07 Adv Css 1.56Sb Pi-fsn--n-a-- 1.88 Obj Ncmsn 1.81 (AuxP) Adv (Spsg)Ncmsg 1.56

Conclusions in this experiment we have designed and

implemented one possible approach: to semi-automatic extraction of a valency

frame lexicon for Croatian verbs to the refinement of existing lexicons by using

the Croatian Dependency Treebank as an underlying resource

we have automatically extracted 2930 verb valency frame instances and annotated 936 frames: the distribution of valency frames for each of

the encountered verbs the distribution of analytical functions and

morphosyntactic tags for each of the tectogrammatical functors

Future work the first result enables the enrichment of

existing valency lexicons, such as CROVALLEX the second result enables the implementation

of a rule-based system for automatic assignment of tectogrammatical functors to morphosyntactically tagged and dependency-parsed unseen text

this procedure of automatic detection of valency frames will be used also in several other projects dealing with factored SMT (e.g. ACCURAT)

regarding dependency parsing of Croatian by using the Croatian Dependency Treebank, we shall undergo various research directions in order to increase overall parsing accuracy

Thank you for your attention.

The research within the project ACCURAT leading to these results has received funding from the

European Union Seventh Framework Programme (FP7/2007-2013), grant agreement

no 248347.

www.accurat-project.eu

verb valency frame extraction using morphological and syntactic features of croatian krešimir...

Documents

verb valency frame extraction

treebank sample algorithm

experiment slide

creation of treebank

dd slide

scratch slide

object obj slide

prepositions auxp slide