verb valency frame extraction using morphological and syntactic features of croatian krešimir...
TRANSCRIPT
Verb Valency Frame Extraction Using Morphological and Syntactic Features of
Croatian
Krešimir Šojat, Željko Agić, Marko Tadić
Department of Linguistics, Department of Information SciencesFaculty of Humanities and Social Sceinces, University of Zagreb
{ksojat, zagic, marko.tadic}@ffzg.hr
FASSBL 7 ConferenceDubrovnik, Croatia
2010-10-05
Overview What?
extraction and semi-automatic construction of verb valency frames
How? rule-based extraction procedure run on the
Croatian dependency treebank manual assignment of tectogrammatical
functors inference of rules for assigning functors to
unseen text Why?
creation of treebank-based verb valency lexicon
enhancement and enrichment of existing resources
Valency frames valency frame extraction means to detect
all possible environments of particular verb as found in the treebank
such an approach aims at fast construction of valency frames
extraction is automatic, no elements of frames added manually by human annotators
such automatically acquired verb valency lexicon can serve as a basis for further enrichment and enhancement of manually constructed resources, either existing or constructed from scratch
The treebank Croatian Dependency Treebank (HOBS)
follows the guidelines of the Prague DT taken from the Croatia Weekly 100 kw sub-
corpus of the Croatian National Corpus (HNK) XCES-encoded up to the word level sentence-delimited, tokenized, manually
lemmatized and MSD-tagged serves as the morphological layer of the treebank
annotated on the syntactic layer approximately 2.700 sentences, 67.000 tokens manually assigned syntactic functions ca 1.300 sentences double-checked and used in
this experiment
The treebank
HR Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj.
ENThe Union has already arranged some measures in order to help Croatia.
Extraction algorithm the algorithm aims at extraction of verb
valency frame instances for each verb in the treebank sample, it
descends one level down the dependency tree to
retrieve subjects (Sb), objects (Obj), adverbs (Adv) and nominal predicates (Pnom)
Two levels down to retrieve tokens from the previous step introduced by subordinate conjunctions (AuxC) or prepositions (AuxP)
Extraction algorithm algorithm illustration
dogovorila (dogovoriti Pred) [Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css
AuxC]
Extraction algorithm the first version retrieved predicates only and
was expanded to retrieve all the verbs from the treebank sample
algorithm adapted to retrieve any verbs found in the dependency structure, regardless of their respective analytical functions and position within the dependency trees
the adaptation itself is implemented in order to raise the recall of the algorithm, while still maintaining its precision by not changing the simple set of descending rules
i.e. to retrieve as much verbs as possible given the limited size of the treebank sample used in the experiment
CCCCyyyyLocationyyyy-mm-dd
Extraction algorithm the verb “imati” (Vmn) is annotated as
object (Obj)
Extraction algorithm Thus, from each sentence the number of
extracted frames correspondes to the number of verbs: one frame for the main clause that captures
the whole syntactic structure of the sentence frames extracted from dependent clauses
naglasio (naglasiti Vmps-sma Pred)[Mikuška Np-sn Sb] [kako->imati Css AuxC->Obj]
imati (imati Vmn Obj)[stanovništvo Ncnsn Sb] [korist Ncfsa Obj] [od->projekta Spsg->Ncmsg AuxP->Adv]
[kroz->ekoturizam Spsa->Ncmsa AuxP->Adv]
Functor assignment In order to annotate verbal frames we used
a set of 5 argument functors and functors for 32 free modification functors: Argument functors: ACT, PAT, ADDR, ORIG, EFF
Temporal functors: TWHEN, TFHL, TFRWH, THL, THO, TOWH, TPAR, TSIN, TTILL
Locative and directional functors: DIR1, DIR2, DIR3, LOC
Functors for causal relations: AIM, CAUS, CNCS, COND, INTT
Functors for expressing manner: ACMP, CPR, CRIT, DIFF, EXT, MANN, MEANS, REG, RESL, RESTR
Functors for specific modifications: BEN, CONTRD, HER, SUBS
936 frame instances were manually annotated for 424 different verbs
Results valency frame frequency across verb lemmas
Verb Frequencybiti 188
imati 23reći 15
dobiti 12raditi 10kazati 9
pokazati 8postati 8vidjeti 8dati 7
raditi (en. to work, to do)
Valency frameFrequenc
yACT PAT 2
ACT CRIT LOC THL 1ACT MANN TWHEN 1ACT MEANS TWHEN 1
ACT PAT TSIN 1dati (en. to give)
Valency frame FrequencyACT ADDR PAT 4
ACT ADDDR PAT 1ACT ADDR AIM PAT 1
ACT PAT 1
Results frequency of verb valency frames, i.e. n-
tuples of tectogrammatical functors
Frame Count PercentACT PAT 250 26.71
PAT* 157 16.77ACT PAT TWHEN 30 3.21ACT MANN PAT 23 2.46ACT ADDR PAT 20 2.14
ACT LOC 20 2.14ACT LOC PAT 20 2.14
MANN PAT 17 1.82ACT CAUS PAT 16 1.71
ACT MANN 13 1.39LOC PAT 12 1.28
ADDR PAT 11 1.18
Other 347 37.07
Results
CCCCyyyyLocationyyyy-mm-dd
frames annotated with MSD, analytical functions and tectogrammatical functors
djelovati(djeluje Pred)
[ neozbiljno Neozbiljno Rnp Adv MANN ][ odustajanje odustajanje Ncnsn Sb ACT ]
osloboditi(oslobodili Pred)
[ ACT ] [ nikada Nikada Rt Adv THL ][ zloduh zloduha Ncmsg Obj PAT ]
postati(postali Pred)
[studij studiji Ncmpn Sb ACT][fakultet fakultet Ncmsn Obj PAT]
zaustaviti(zaustavio Atr)
[ ACT ] [ oni ih Pp3-pa--y-n-- Obj PAT ][ dolina u->dolini Spsl->Ncfsl AuxP->Adv
LOC ]
Results Distribution of (MSD, analytical function)
pairs across tectogrammatical functors
serves as basis for defining functor assignment rules from MSD and analytical function
ACT (Actor) PAT (Patient) LOC (Locative)
A-fun MSD % A-fun MSD % A-fun MSD %
Sb Ncmsn 14.91 Obj Ncfsa 11.25 (AuxP) Adv (Spsl) Ncfsl 21.88Sb Np-sn 13.50 Obj Ncmsa 9.18 (AuxP) Adv (Spsl) Ncmsl 16.41Sb Ncfsn 12.87 Pnom Ncmsn 5.69 (AuxP) Adv (Spsl) Npmsl 10.16Sb Ncmpn 9.89 Obj Ncmpa 4.53 (AuxP) Adv (Spsl) Ncnsl 8.59Sb Npfsn 5.65 Obj Vmn* 4.40 (AuxP) Adv (Spsl) Npfsl 8.59Sb Pi-mpn--n-a-- 4.71 Obj Ncnsa 3.75 (AuxP) Adv (Spsl) Ncmpl 5.47Sb Ncfpn 3.30 Obj Ncfpa 3.49 (AuxP) Adv (Spsl) Ncfpl 3.91Sb Ncnsn 2.98 Pnom Ncfsn 2.72 Adv Rl 3.13Sb Pi-msn--n-a-- 2.51 (AuxC) Obj (Css) Vmip3s 2.07 Adv Css 1.56Sb Pi-fsn--n-a-- 1.88 Obj Ncmsn 1.81 (AuxP) Adv (Spsg)Ncmsg 1.56
Conclusions in this experiment we have designed and
implemented one possible approach: to semi-automatic extraction of a valency
frame lexicon for Croatian verbs to the refinement of existing lexicons by using
the Croatian Dependency Treebank as an underlying resource
we have automatically extracted 2930 verb valency frame instances and annotated 936 frames: the distribution of valency frames for each of
the encountered verbs the distribution of analytical functions and
morphosyntactic tags for each of the tectogrammatical functors
Future work the first result enables the enrichment of
existing valency lexicons, such as CROVALLEX the second result enables the implementation
of a rule-based system for automatic assignment of tectogrammatical functors to morphosyntactically tagged and dependency-parsed unseen text
this procedure of automatic detection of valency frames will be used also in several other projects dealing with factored SMT (e.g. ACCURAT)
regarding dependency parsing of Croatian by using the Croatian Dependency Treebank, we shall undergo various research directions in order to increase overall parsing accuracy
Thank you for your attention.
The research within the project ACCURAT leading to these results has received funding from the
European Union Seventh Framework Programme (FP7/2007-2013), grant agreement
no 248347.
www.accurat-project.eu