CBA-08, Barcelona November 13th-15th 2008Dias 1
Centre for Language Technology
Co-referential chains and discourse topic shifts in parallel and comparable corpora
Costanza [email protected]
Dias 2
Centre for Language Technology
Outline
• Motivation• Preceding studies/projects• Background• The data• The annotation• Problems• Some results
Dias 3
Centre for Language Technology
Motivation
1. to provide a corpus of parallel and comparable Danish and Italian texts annotated with (co)-reference and with discourse topic shifts (language studies,anaphora resolution, MT, generation)
2. to investigate whether there is a systematic relation between the use of various types of referring expression and different discourse transition states in the two languages
3. to individuate similarities and differences in the use of various referring expressions in Danish and Italian
Dias 4
Centre for Language Technology
Previous work
• Study on the use and resolution of pronouns in Danish
• MULINCO project (Maegaard et al. 2006)
• DAD project (Navarretta & Olsen 2008)
• Annotation seminar at University of Copenhagen (september 2008)
Dias 5
Centre for Language Technology
Things to be inquired
• Referring expressions are used differently in English, Danish and Italian (theoretic and practical problems)
• Differences in the way the three languages use various types of pronoun in abstract reference
• Impression that Danish and Italian use different strategies in reference especially in relation to topic shifts
Dias 6
Centre for Language Technology
Background: relation between reference and discourse structure
Kuno 1972, Halliday and Hasan 1976 Hobbs 1982 (coherence relations + reference resolution in an abductive framework)Givón 1983 (major and minor junctures in dialogue transcriptions )Cristea et al. 1998: Veins Theory inside Rhetorical Structure Theory (Mann and Thompson 1987)
Dias 7
Centre for Language Technology
Background - continued
Centering framework (Grosz et al. 1995):
• presupposes global coherence: Grosz and Sidner 1986
• is about local coherence• mainly regards pronouns• compatible with cognitive models of reference
of nominal expressions, i.a. (Givón 1983, Gundel et al. 1993, Prince 1981): use of referring expressions reflects the assumption made by speakers about the addressees’ mental state at that point in discourse
Dias 8
Centre for Language Technology
Local transitions (Brennan et al. 1987, Fais 2004, Poesio et al. 2004)
Cb(Un)=Cb(Un-1) or Cb(Un-1) =NIL
Cb(Un)=Cb(Un-1)
Cb(Un)=Cp(Un) Continue Smooth Shift
Cb(Un) ≠ Cp(Un) Retain Rough Shift
Presence/absence of backward-looking center (Cb)Nature of instantiation of discourse entities
Dias 9
Centre for Language Technology
Background: Salience and nominal expressions
in focus > activated> familiar> uniq. identifiable>type ident. it that that N the N a N this this N
Gundel et al. (1993)
zero pronouns < cliticized pronouns < unstressed pronouns <stressed pronouns < stressed pronouns + gestures <proximal demonstrative < distal demonstratives < first name or last name < definite description < full name
Ariel (1988, 1994)
Dias 10
Centre for Language Technology
The annotated corpora
Parallel corpora• European law texts• Short stories and translations (Pirandello)• short stories and translations (Villy Sørensen?)
Comparable corpora• Financial newspapers (Il Sole 24 Ore, Børsen)• Newspaper articles until now:approx. 24,000 words for Italianapprox. 19,000 words for Danish
Dias 11
Centre for Language Technology
The annotation
(Co)referenceAnnotation of (co)reference by
substantives added on a small subset of the DAD corpus (annotated with pronominal abstract anaphora and 3rd person singular neuter pronouns)
Annotators: Italian (6 on a first subset of the data than divided in groups of 2)
Annotators: Danish (4 then 2)
Dias 12
Centre for Language Technology
The annotation – continued
• builds upon the MATE/GNOME annotation (Poesio 2004)
• includes both reference to objects introduced in discourse by nominal phrases and reference to objects introduced by i.a. verbal phrases, clauses, discourse segments, predicates in copula constructions
Dias 13
Centre for Language Technology
The annotation continued
• function of pronouns (pleonastic, cataphoric, deictic, anaphoric, individual, abstract, textual deictic, vague, abandoned )
• information about type of referring expression (type of NP, see also Poesio et al. 2004)
• type of relation between referring expression and antecedent (identity/non-identity/other?)
• syntactic type of antecedent (e.g. type of clause, discourse segment, other…)
• semantic type of abstract referents (Asher 1993, Gundel et al. 2003, Navarretta & Olsen 2008)
Dias 14
Centre for Language Technology
Some problems
• pronouns (referring expressions in general) can be multifunctional (anaphoric and cataphoric)
• definition of clauses in Danish and Italian • reference relations: non-identity too general–
antecedent and referring expression related or context determining semantic difference of referents
• possessives• granularity of semantic types• direct speech, deictic I, you…
Dias 15
Centre for Language Technology
Discourse topics
Global level (all files):paragraphs are considered to be starting a topic, then subtopic and subsubtopic (Rocha 1997)
local level (only part of the data): continue/retain/smooth shift/rough shift
Dias 16
Centre for Language Technology
The annotation schemes
• two slightly different annotation schemes, the Italian scheme accounting for zero anaphora (Italian is a subject pro-drop language), clitic pronouns, reference to PP
• de, seg elements as in MATE/GNOME• added explet, abandoned, chunk• added seg1 for clitics and zero-anaphora
(Italian)• added a number of extra attributes• tool PALinkA (Orasan 2003): anchor+ref
substituted by link (attributes identity/non-identity and dislink to annotate discontinuous elements)
Dias 17
Centre for Language Technology
Interannotator agreement: Italian
6 annotators on the first 4000 wordsweighed kappa statistics (Cohen, 1968): PRAM
http://www.geocities.com/skymegsoftware/pram.html
• In-between 0.75 (abstract reference by NPs) and 0.95
On rest of the data varying agreement (depend on annotators, data etc)
Humans are not machines: a number of referring expressions are “forgotten” by 1 or both annotators, other distraction errors.
Dias 18
Centre for Language Technology
<P id="p35" topic="t35.1"> <S id="s35.1"> <transition ttype="TNULL“/> <de ID="n173" syn-type="NPR"> <link Ltype="ident" POINT-BACK="n172"/> <W id="w35.1.1">La</W><W id="w35.1.2">Acqua</W><W
id="w35.1.3">Marcia</W> </de> <W id="w35.1.4">può</W><W id="w35.1.5">evitare</W> <de ID="n521" syn-type="DNP"> <W id="w35.1.6">il</W><W id="w35.1.7">fallimento</W></de> <W id="w35.1.8">.</W> </S> <S id="s35.2"> <transition ttype=“CONTINUE“/> <de ID="n174" syn-type="DNP+GP"> <link Ltype="ident" POINT-BACK="n173"/> <W id="w35.2.1">La</W><W id="w35.2.2">finanziaria</W> <W
id="w35.2.3">di</W> <de ID="n522" syn-type="NPR"> <W id="w35.2.4">Vincenzo</W> <W id="w35.2.5">Romagnoli</W> </de></de>.... </S>...</P>
Dias 19
Centre for Language Technology
First results – genre differences
• (co)referential chains in literary texts much longer than in the financial articles where coherence is often given by domain knowledge
• pronouns more frequently used in literary texts
• distance between referring expression and antecedent extremely high in literary texts (is there coreference when the distance is more than 50 clauses?)
Dias 20
Centre for Language Technology
Differences between the two languages
Inferable entities are more often anchored to known entities by genitives in Danish
• Fin dal primo giorno, Bartolino Fiorenzo s’era sentito dire dalla promessa sposa…
• Fra første dag havde Bartolino Fiorenzo hørt sin tilkommende sige…
(From the very first day Bartolino Fiorenzo had heard (his/the) fiancée say)
Pirandello La buona anima
Dias 21
Centre for Language Technology
Differences in use of proximal/distal demonstrative + N'
Italian quel/quello/quella (that) + N' used if:• there are other clauses or nominals inbetween
referring expression and antecedent • there is a temporal or spatial distance from the
antecedent Danish denne (this) + N': in the same contexts (there can
be clauses and nominals inbetween, but no competing antecedents, i.e. antecedents of the same semantic type)
quella donna/denne kvinde (woman)quella sciagura/denne ulykke (calamity)quella gioia/denne glæde (joy)
questo ragionamento/dette argument (this argument/this reasoning) when the antecedent is the immediately preceding discourse segment
Dias 22
Centre for Language Technology
Transition states and referring expressions - Italian
Continue: Zero> Pronoun> the N
Retain: Pronoun > Proper Name >…>Zero
Smooth Shift: Proper Name > the N >Pronoun
Rough Shift: the N > genitive N > Proper Name (+ the N)> distal N >a N >Pronoun
NULL: Proper name > the N