ra – ics uaic romanian semantic role resource diana trandabăţ 1,3 and maria husarciuc 1,2 1...

26
RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi, Romania 2 Faculty of Letters, “Al. I. Cuza” University of Iaşi, Romania 3 Institute for Computer Science, Romanian Academy

Upload: rachel-gaines

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

R

A –

IC

S

U

AIC

Romanian Semantic Role Resource

Diana Trandabăţ1,3 and Maria Husarciuc1,2

1Faculty of Computer Science, “Al. I. Cuza” University of Iaşi, Romania2Faculty of Letters, “Al. I. Cuza” University of Iaşi, Romania

3Institute for Computer Science, Romanian Academy

Page 2: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Motivation

Annotated language resources have became a must in natural language processing, for: supervised learning: training and evaluation unsupervised learning: evaluation hand-crafted systems: evaluation.

Quality control is an important issue, since annotations, in order to be used as gold standard for evaluation, need to be very accurate.

What if we have short deadlines and limited human and financial possibilities?

A good solution would be to use existing language resources (built with considerable efforts for a specific language), and import them for a new language.

Page 3: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Predication

Predicational word – a word that demands a specific argument structure in order to express its sense.

Each predicational word has: Arguments:

Verbs with one argument: John leaves. Verbs with two arguments: John reads a book. Verbs with three arguments: John gives a book to Mary.

Adjuncts John leaves New York.

Page 4: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Predication

Besides verbs, there are also predicational nouns (also called nominalizations) and predicational adjectives:

predicational verbsJohn wrote the paper on time.

predicational nounsJohn’s writing of the paper was difficult.

predicational adjectives.

The paper written by John was the best one.

Page 5: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Case Grammar

Language representation: Surface Structure (the syntactic knowledge) Deep Structure (the semantic knowledge).

Case Roles (Ch. Fillmore) - representations at a semantic level of the lexical arguments

Examples: AGENT: Columbus discovered America. PATIENT: Columbus discovered America. INSTRUMENT: The window was broken by the storm. Temporal LOCALISATION: They dined at 5 a.m. Spatial LOCALISATION: John goes to London. etc.

Page 6: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Semantic Frames Databases

FrameNet http://framenet.icsi.berkeley.edu/ 135.000 examples form the British National Corpus 10.000 lexical units over 800 semantic frames

PropBank http://www.cs.rochester.edu/~gildea/PropBank/Sort Semantic annotation for PennTreebank

Salsa http://www.coli.uni-saarland.de/projects/salsa

VerbNet http://verbs.colorado.edu/~kipper/verbnet.html

Page 7: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC FrameNet

FrameNet is a lexicographic research project developed at Berkley University, California, which produced a lexicon containing very detailed information about the English predicational words (verbs, nouns and adjectives).

A frame structure: a definition a set of frame elements FEs (semantic roles): valences for a target

predicational word. core frame elements: mandatory for the verb lexico-semantic

realization > arguments non-core frame elements: facultative > adjuncts

a set of lexical units LUs: a predicational word for which combinatory properties (the semantic frame) applies to.

Page 8: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC FrameNet: frame example for “sell”

Frame elements (semantic roles): Core FE: Buyer, Seller, Goods Non - core FE: Duration, Manner, Means, Money, Place, Purpose,

Rate, Reason, Time, Unit

Lexical units: Verbs: retail, sell, vend Nouns: retailer, vendor.

Example: [He]Seller will probably [sell]Target [her]Buyer [the book]Goods [for $15]Money.

Page 9: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Semantic Role Resource: Building from Scratch or Importing?

Page 10: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Semantic Role Resource: Building from Scratch or Importing? Annotation of a new corpus

Considering that we have the corpus, the schema, the software and two very well trained annotators, with good semantic frames knowledge, and that we only need to worry about the annotation process itself.

Our tests revealed that a person can annotate an average of 30 medium sized sentences per hour. For a target of 100.000 sentences, we computed around 3500 hours, i.e. 20 months, considering 8 hours a day, 5 days a week working time.

The main problem with this approach was the lack of a definite list of possible semantic roles. Therefore, different annotators can give different names (agent or seller or vendor for instance) to the same role, confusing the corpus quality metrics.

Page 11: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Semantic Role Resource: Building from Scratch or Importing? Import of the annotation

For the import method, the main time consuming task is the translation.

A professional translator can translate up to 40-50 sentences an hour, even faster if translation memory is used.

But the real gain is that the corpus can be split to several translators (cheaper and easier to find than semantic annotators).

After the automatic alignment and import, a single annotator is needed to perform the validation of the created corpus, focusing on cases where the alignment was not 1:1 (~ 15% of the total number of sentences).

Page 12: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Towards a Romanian Semantic Frames database Considering those calculations, the fact that we didn’t had two

annotators to work for 20 months just on semantic annotation, and the belief that once we have the import program, every other language could benefit from it and transfer annotations for its own language, we created a Romanian FrameNet based on the English annotation.

The intuition Most of the frames defined in the English FrameNet are likely to be

valid cross-linguistically Semantic frames express conceptual structures, language

independent at the deep structure level. The surface realization is realized according to each language

syntactic constraints.

Page 13: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Steps towards a parallel Romanian/English FrameNet manual translation, by professional translators, of 1094

sentences from the English FrameNet: 110 randomly selected sentences and the Event frame.

word level alignment of the Romanian sentences with the English ones using the aligner developed by the Institute of Research in Artificial Intelligence.

automatic import of the English annotation, followed by a manual verification to detect the mismatching cases

an optimization process which, based on inference rules, corrects the automatic annotation.

Page 14: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

EUROLAN 2005 Summer School

Automatic import

English annotated

FrameNet files

English annotated

FrameNet filesRomanian translated sentences collection

Romanian translated sentences collection

Romanian annotated sentences

Romanian annotated sentences

Translation

Word level alignment between EN and RO files

Frame Elementsimport

Page 15: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

The algorithm:

o reading the English XML files and the alignment files;

o labeling each English word with the corresponding semantic role (FE) converting the character indexes into a word level annotation;

o mapping the English words with the aligned Romanian correspondences;

o writing an output XML file containing the Romanian annotated corpus.

EUROLAN 2005 Summer School

Automatic import

Page 16: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

<annotationSet ID="1052804" status="MANUAL"> <layers> <layer ID="6375447" name="FE"> <labels> <label name="Event" start="0" end="11" /> <label name="Time" start="22" end="62" /> <label name="Place" start="64" end="106" /> </labels> </layer> <layer ID="6375452" name="Target"> <labels> <label name="Target" start="13" end="20" /> </labels> </layer> <layer ID="6375453" name="Verb" /> </layers> <sentence ID="797186" aPos="103724676"> <text>The incident occurred after a dispute between the man and staff at a branch

of the Bank of Ireland in Cahir . </text> </sentence> </annotationSet>

EUROLAN 2005 Summer School

English semantic roles

Page 17: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

<annotationSet ID="1" status="AUTOMATIC"> <layers> <layer ID="6375447" name="FE"> <labels> <label name="Event" start="0" end="9" /> <label name="Time" start="20" end="59" /> <label name="Place" start="61" end="101" /> </labels> </layer> <layer ID="6375452" name="Target"> <labels> <label name="Target" start="11" end="18" /> </labels> </layer> <layer ID="6375453" name="Verb" /> </layers> <sentence ID="671" aPos="103724676"> <text>Incidentul a apărut după o dispută între individ şi personal la o filială a Băncii

Irlandeze din Cahir . </text> </sentence> </annotationSet>

EUROLAN 2005 Summer School

Romanian semantic roles

Page 18: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Optimization of the Romanian obtained database Translations:

realized by professional translators to minimize errors. problems mainly due to the lack of the context in English

sentences. however, if the English semantic frame is considered, this problem

is surmountable.

Alignment: performed with the aligner developed by the Institute of Research in

Artificial Intelligence, which is considered to have a precision of 87.17% and a recall of 70.25%.

however, the aligner results were manually validated before entering the annotation import program

Page 19: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

EUROLAN 2005 Summer School

The assessment of the correctitude of the obtained Romanian corpus is preformed manually.

The first results of the annotation import show an overall accuracy of approx. 83%.

The validation focuses on detecting the cases where the import has failed, trying to discover if the problems are due to the translation or to the semantic or syntactic specificities of Romanian.

Import difficulties: the double annotation the existence of imbricate frame elements unexpressed semantic frames the lack of total correspondence between English and

Romanian frames.

Optimization of the Romanian obtained database

Page 20: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Double annotation

The double annotation applies only to the non-core frame elements, due to the fact that the same phrase can refer to multiple circumstances (peripheral roles) of an event.

When a semantic element is double annotated in English, the same generally holds also for Romanian.

The most frequent case of double annotation is for the Time/Cause roles, since almost any temporal specification involves a cause and/or a goal.

[The incident]Event OCCURRED [after a dispute between the man and staff]Time/Cause [at a branch of the Bank of Ireland in Cahir]Place

[Incidentul]Event A APĂRUT [după o dispută între individ şi personal]Time/Cause [la o filială a Băncii Irlandeze din Cahir]Place.

Page 21: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Imbrications

A word can be part of two semantic elements without being double annotated.

The imbrication process is common in the English annotations mainly in the possessive noun phrases. The imbrication process doesn’t occur in Romanian.

Even if we don’t have an absolute correspondence between the whole FE BodyPart form English into Romanian, the noun mâna (hand) is correctly annotated in Romanian as representing the BodyPart frame.

[When she got over the stroke]Time/Cause [she]Exp fell and BROKE [[her]Exp hand]]BodyPart.

[Când şi-a revenit după atac]Time/Cause , a căzut şi [şi]Exp-a RUPT [mâna]BodyPart .

Page 22: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Imbrications

The import of the annotation works also when the Romanian target-word is a gerund followed by a reflexive pronoun and a noun phrase:

Although apparently similar to the English structure, in the Romanian sentence, the frame elements are not imbricate, but successive, since the regent of the pronoun şi, is not the noun glezna (ankle), but the gerundive verb.

[Josef Jakobs]Prot landed in a potato field in North Stifford , Essex , falling heavily and BREAKING [[his]Prot ankle]]BodyP .

[Josef Jakobs]Prot a aterizat într-un câmp de cartofi în North Stifford , Essex , căzând greu şi RUPÂNDU-[şi]Prot [glezna]BodyP .

Page 23: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC Unexpressed Semantic Frames

A FE can be expressed in English, but implicit in Romanian, or vice-versa. If the first case poses no problems to the transfer, the second one supposes importing roles unexpressed in English.

[Blood]Undergoer had CONGEALED [thickly]Manner [on the end of the smashed fibula]Place .[Sângele]Undergoer se ÎNGROŞĂ [spre capătul fibulei zdrobite]Place . QUIT [smoking]Process .LĂSAŢI-[vă]Protagonist [de fumat]Process .

Page 24: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

The lack of total correspondence between frames In the English FrameNet, similar sentences can serve as

examples for different, related, frames.

The relation between Communication and Contacting frame is illustrated by two sentences that are apparently semantically equivalent:

The Romanian translation of both sentences is similar due to the absence in Romanian of a simple verb corresponding to e-mail:

Contacting frame: I e-mailed him my new phone number.Communication frame: I communicated my new phone number to him by e-mail.

Communication frame: I-am trimis prin e-mail noul meu număr de telefon.

Page 25: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

The import method was preferred to the ‘classical’ creation by hand of a manually annotated corpus because of its possible automation. We currently investigate the possibility of using a translation engine for the most time consuming task, namely the translation of the English sentences.

The resulted resource can also be used as a verifying resource for the syntactic annotation.

FrameNet comes, besides frame elements, with a syntactic analysis of each the sentences. This annotation can also be imported, but it is not representative, since the syntax represents the surface level, thus the one with language specificities.

Therefore, the Romanian sentences are syntactically parsed at the alignment stage. The comparison of the two annotations is a very useful to create a syntax transfer model.

Conclusions and Further work

Page 26: RA – ICS UAIC Romanian Semantic Role Resource Diana Trandabăţ 1,3 and Maria Husarciuc 1,2 1 Faculty of Computer Science, “Al. I. Cuza” University of Iaşi,

RA

– I

CS

U

AIC

Thank you!

[email protected]@gmail.com