attribution and the pdtb - penn engineeringpdtb2012/assets/... · •annotation of attributions not...
TRANSCRIPT
Attribution and the PDTB
Silvia Pareti
The University of Edinburgh School of Informatics
Outline
• Introduction
• Attribution in the PDTB
• Annotation schema extension
• Resources development
• Preliminary achievements
• Future directions
Introduction - Attribution
(wsj 0961)
PDTB - Attribution Annotation
Mr. Nemeth said in parliament that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)
____Explicit____
9163..9165
#### Text ####
if
#### Features ####
Ot, Comm, Null, Null
9067..9096
#### Text ####
Mr. Nemeth said in parliament
##############
if, Contingency.Condition.Unreal present
____Arg1____
9097..9162
#### Text ####
that Czechoslovakia and Hungary would suffer environmental damage
#### Features ####
Inh, Null, Null, Null
____Arg2____
9166..9201
#### Text ####
the twin dams were built as planned
#### Features ####
Inh, Null, Null, Null
(Prasad et al., 2008)
Other corpora with attribution
• MPQA Opinion Corpus (Wiebe et al., 2002)
– 692 articles
– intra-sentential annotation
• RST Discourse Treebank (Carlson&Marcu, 2001)
– 385 articles
– intra-sentential, only explicit sources, verb cues or according to
• GraphBank (Wolf&Gibson, 2005)
– 135 articles
– only attributions not overlapping with other discourse relations
• Other smaller or low-coverage projects
– Sidney Morning Herald Corpus (O’Keefe et al., submitted)
– Corpus TCC and RHETALHO (Pardo et al., 2004)
PDTB - Advantages
Large corpus
less frequent structures and strategies are better observed, e.g. :
Groused Robert Antolini, head of over-the-counter trading at Donaldson, Lufkin & Jenrette: "It's making it tough for traders to make money”. (wsj_1142)
For some at the SEC, an agency that covets its independence, Mr. Breeden may be too much of a Washington insider. (wsj_0955)
PDTB - Advantages
The range of attributions covered is not pre-defined
• Attributions are not limited to the sentence level
• A wide range of attributions are annotated:
– direct, indirect and mixed
– having named or not named, explicit as well as implicit sources (e.g. it is believed…)
– having verb and non-verb cues (e.g. idea, for)
• Includes some relevant features
PDTB - Extensions
• Finer grained annotation of the attribution span: source, cue, circumstantial information
• Completing content spans of some direct or mixed attributions
PDTB - Extensions
• Finer grained annotation of the attribution span: source, cue, circumstantial information
• Completing content spans of some direct or mixed attributions
"It's just sort of a one-upsmanship thing with some people," added Larry Shapiro. "They like to talk about having the new Red Rock Terrace one of Diamond Creek's Cabernets or the Dunn 1985 Cabernet, or the Petrus.
Producers have seen this market opening up and they're now creating wines that appeal to these people."
(wsj 0071)
• Annotation of attributions not overlapping with discourse relations
• Annotation of nested attributions
PDTB - Extensions
• Annotation of attributions not overlapping with discourse relations
• Annotation of nested attributions
["The Caterpillar people aren't too happy when they see their equipment used like that,"]
[shrugs] [Mr. George].
["They figure it's not a very good advert.“] (wsj 1121)
PDTB - Extensions
[They] [figure] [it's not a very good advert]
Annotation Schema
source
cue
SUPPLEMENT
content
[Mr. Nemeth said IN PARLIAMENT] that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)
[PDTB attribution span]
PDTB discourse connective /Arg1/Arg2 text spans
attribution type
source type
• assertion (e.g. say, mention)
• belief (e.g. think, doubt)
• fact (e.g. remember, know)
• eventuality (e.g. allow)
PDTB Attribution Features
• writer (if explicit, e.g. I think...)
• other (e.g. Mr. Brown, a witness)
• arbitrary (e.g. one, people)
• mixed (e.g. My assessment and everyone's assessment is…(wsj_2012))
factuality (determinacy)
scopal change (scopal polarity)
• factual
• non-factual
PDTB Attribution Features
• none
• scopal change
Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare seriamente una fase di riforme anche elettorali, Ø penso che la legislatura possa utilmente proseguire. (re075)
If there is a majority at the Parliament able to seriously face a phase of reforms, also electoral, (I) think that the legislature could usefully continue.
source attitude
authorial stance
•neutral (e.g. say, add)
•positive (e.g. welcome, beam)
•critical (e.g. lament, fume)
•tentative (e.g. believe, suggest)
•other (e.g. joke)
New Attribution Features
•committed (e.g. admit, know)
•not-committed (e.g. lie, claim)
•neutral (e.g. say, suggest)
New Attribution Features
Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)
Attribution type: assertion
Source type: other
Factuality: factual
Scopal change: none
Source attitude: neutral
Authorial stance: neutral
New Attribution Features
Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037)
Attribution type: assertion
Source type: other
Factuality: factual
Scopal change: none
Source attitude: neutral
Authorial stance: neutral
Confronted, Mrs. Yeargin admitted she had given the questions and answers two days before the examination to two low-ability geography classes.(wsj 0044) Authorial stance: committed
"I think that this magazine is not only called Garbage, but it is practicing journalistic garbage," fumes a spokesman for Campbell Soup.(wsj 0062) Source attitude: negative
Inter-Annotator Agreement
• 2 annotators
• 14 articles (PDTB)
• annotation manual
• training on an article
• MMAX2 annotation tool (Müller&Strube,2006)
• complete annotation schema
Data:
• 491 attributions
(22% are nested)
(Pareti, 2012 submitted)
Results - Existence of Attribution
0.87 agr proportion of commonly annotated relations with respect to the annotations identified overall by Annotator A and Annotator B
NOTE: writer attributions were annotated only if explicit
Span selection tasks (agr metric):
Cue Source Content Supplement 0.97 0.94 0.95 0.37
Results- Features
PERCENT AGREEMENT COHEN'S KAPPA
TYPE 83.42(317) 0.63
SOURCE 95(361) 0.71
SCOPAL CHANGE 98.68(375) 0.60
AUTHORIAL STANCE 94.47(359) 0.20
SOURCE ATTITUDE 82.36(313) 0.48
FACTUALITY 97.63(371) 0.73
Italian Attribution Corpus-ItAC
• 50 articles (37,000 tokens) from Italian newspaper corpora (e.g. La Repubblica)
• 460 attribution relations
• Freely available from: http://homepages.inf.ed.ac.uk/s1052974/resources.php
(Pareti and Prodanof, 2010)
PDTB Attribution Corpus
Stand-off annotation of attribution based on the PDTB:
• Comprises all attribution relations annotated in the PDTB (reconstructed from the current annotation)
• The annotation is further extended according to the revised annotation schema
(Pareti, 2012)
9868 attributions
PDTB Attribution Corpus Annotation of the attribution span: source cue SUPPLEMENT
80% automatically, then manually revised, using 48 matching rules, e.g.: (NP-SBJ)(VP) one person said (PP-LOC)(NP)(VB) IN DALLAS, LTV said (NP-SBJ)(VBP)(JJ) I am sure
20 % had rarer syntax and was manually annotated, e.g.:
Judge Curry ordered the refunds to begin Feb. 1 and said (wsj 0015)
PDTB Attribution Corpus
Further annotation of the content span:
– adding punctuation (direct quotation marks)
– completing content spans that had only been partially annotated
– annotating the quote status of the attribution based on the position of quote span QS and content span CS:
• direct QS = CS
• indirect CS outside or contained in QS
• mixed CS overlaps QS or QS contained in CS
PDTB Attribution Corpus
ATTRIBUTION ID: wsj_0003.pdtb_05 SOURCE SPAN: Darrell Phillips, vice president of human resources for Hollingsworth & Vose CUE SPAN: said CONTENT SPAN: “There’s no question that some of those workers and managers contracted asbestos–related diseases,” “But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today.” SUPPLEMENT SPAN: None FEATURES: Ot, Comm, Null, Null QUOTE STATUS: Direct
Use of PDTB Attribution Corpus
Independent analysis of attribution:
• cue composition
– several cues other than verbs (prepositions, nouns, adverbs)
– wide range of attributional verbs (266 types in the corpus)
• source composition
– NEs only about 50% of the sources
• attribution structures
Use of PDTB Attribution Corpus
Testing a system for the identification of direct quotes and their speaker in the literature and news domains. University of Sydney and Sydney Morning Herald
(O’Keefe et al. 2012, submitted).
• rule-based and machine-learning based approaches have been tested on 3 corpora.
• Approaches results show that direct quotes differ by domain and style
Future
• Development of an attribution extraction system using the data to train a classifier
• Semi-automatic extension of the annotation to comprise all attributions in the corpus
• Annotation of the level of nesting of each attribution
• Release of the corpus for development/testing and shared tasks usages
Conclusion
• Advantages of attribution in the PDTB
• Development of a finer-grained annotation schema and its inter-annotator agreement results
• Application of the schema to a small corpus of Italian
• Collection and further annotation of attribution in the PDTB
• Importance of this resource for the analysis of attribution and its ‘long tail’ and for testing and developing attribution extraction systems
Bibliography
Carlson, L. and Marcu, D. Discourse tagging reference manual. Technical report ISITR- 545. Technical report, ISI, University of Southern California, September 2001.
Müller, C. and Strube, M., Multi-Level Annotation of Linguistic Data with MMAX2. In: Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp. 197-214. (English Corpus Linguistics, Vol.3 ), 2006.
O’Keefe, T., Pareti, S., Curran, J., Koprinska, I. and Honnibal, M., A sequence labelling approach to quote attribution. Manuscript submitted for publication, 2012.
Pardo, T., das Graças Volpe Nunes, M. and Rino, L.. Dizer: An automatic discourse analyzer for Brazilian Portuguese. In Ana Bazzan and Sofiane Labidi, editors, Advances in Artificial Intelligence – SBIA 2004, volume 3171 of Lecture Notes in Computer Science, pages 224–234. Springer Berlin / Heidelberg, 2004.
Pareti, S. and Prodanof, I. Annotating attribution relations: Towards an Italian discourse treebank. In Proceedings of LREC10, 2010.
Pareti,S. A database of attribution relations. In Proceedings of LREC12, Istanbul, 23-25 May 2012 (to appear).
Pareti, S., Theory and practise of annotating attributions. Manuscript submitted for publication, 2012.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A. and Webber, B. The Penn Discourse Treebank 2.0. In Proceedings of LREC08, 2008.
Wiebe, J. Instructions for annotating opinions in newspaper articles. Technical report, University of Pittsburgh, 2002.
Wolf, F. and Gibson, E. Representing discourse coherence: A corpus-based study. Comput. Linguist., 31:249288, June 2005.