evaluating algorithms for gre
DESCRIPTION
Evaluating Algorithms for GRE. Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen, Scotland, UK. Outline. GRE: G eneration of R eferring E xpressions TUNA project: Corpus and Annotation Evaluation of Algorithms Furniture Domain - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/1.jpg)
SELLC Winter School 2010
Evaluating Algorithms for GRE
Kees van Deemter(work with Albert Gatt, Ielka van der
Sluis, and Richard Power)
University of Aberdeen, Scotland, UK
![Page 2: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/2.jpg)
SELLC Winter School 2010
Outline
• GRE: Generation of Referring Expressions
• TUNA project: Corpus and Annotation
• Evaluation of Algorithms – Furniture Domain– People Domain
• [ Evaluation in the real world: STEC ]
![Page 3: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/3.jpg)
SELLC Winter School 2010
TUNA project (ended Feb. 2007)
• TUNA: Towards a UNified Algorithm for Generating Referring Expressions.
1. Extend coverage of GRE algorithms (plurals, negation, gradable properties,…)
2. Improve empirical foundations of GRE
• Focus on – Content Determination– “First mention” NPs (no anaphora!)
![Page 4: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/4.jpg)
Background
• Dale and Reiter hypothesised that the Incremental Algorithm (IA) led to “better” output than other algorithms
– “better”: more human-like– other algorithms: see below
SELLC Winter School 2010
![Page 5: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/5.jpg)
SELLC Winter School 2010
Other GRE Algorithms
• Full Brevity (FB; Dale 1989)– Generation of minimal descriptions– For example, by first trying all descriptions of length 1,
then length 2, and so on.
• Greedy Algorithm (GR; Dale 1989) – Always add property that removes the most
distractors
![Page 6: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/6.jpg)
SELLC Winter School 2010
Elicitation experiment
• Participants were told that we wanted to test an AI program that interprets referring expressions
• Participants were shown a series of domains• Each domain included 1 or 2 target objects• Participants entered their descriptions,
then the referents were removed• To make the interaction seem real, we sometimes
removed the wrong object! (25% of trials)– The experiment was later repeated without this feature – Essentially the same outcomes were found
• For generality: two types of domains (furniture, people)
![Page 7: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/7.jpg)
SELLC Winter School 2010
Furniture trial
![Page 8: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/8.jpg)
SELLC Winter School 2010
People trial
![Page 9: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/9.jpg)
SELLC Winter School 2010
Method (overview)• Experiment leads to transparent corpus of referring
expressions:– referent and distractors are known– Domain attributes are known
• Transparent corpora can be used for many purposes
This talk: Compare some classic algorithms– giving each algorithm the same input as subjects– computing how similar algorithm’s output is to subjects’ output– We count semantic content only
![Page 10: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/10.jpg)
SELLC Winter School 2010
Elicitation Experiment
• Furniture (simple domain)– TYPE, COLOUR, SIZE, ORIENTATION
• People (complex domain)– Nine annotated properties in total
Location:– Vertical location (Y-DIMENSION)– Horizontal location (X-DIMENSION)
the green desk facing backwards
the sofa and the desk which are red
the young man with a white shirtthe man with the funny haircut
the man on the left
the chair in the top right
![Page 11: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/11.jpg)
SELLC Winter School 2010
Corpus setup• Each corpus was carefully balanced, e.g. between
singulars and plurals.
• Between-subjects design:
-Location: Subjects discouraged from using locative expressions.+Location: Subjects not discouraged.
-FaultCritical: Subjects could correct their utterances+FaultCritical: Subjects could not correct their utterances
• After discounting outliers and (self-reported) non-fluent speakers, 45 subjects were left
![Page 12: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/12.jpg)
SELLC Winter School 2010
• Experiment design: Furniture (-Location)
• 18 trials (C=Colour, O=orientation, S=size)– 1 referent: minimal identification uses
{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]– 2 “similar” referents
{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]– 2 “dissimilar” referents
{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]
![Page 13: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/13.jpg)
SELLC Winter School 2010
Other evaluation studies
Limitations:
• Limited numbers of subjects/referents
• Few attempts at balancing the corpus
• IA: no teasing apart of preference orders
NB Some of these studies were more ambitious in some respects, looking at context, and going beyond identification
![Page 14: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/14.jpg)
SELLC Winter School 2010
Other evaluation studies
• Jordan 2000, Jordan & Walker 2005– More than just identification (Jordan 2000)
• Siddharthan & Copestake 2004– References in linguistic context
• Gupta & Stent 2005– Realisation mixed with Content Determination
• Viethen & Dale 2006– Only Colour and Location
![Page 15: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/15.jpg)
SELLC Winter School 2010
Extensions to the classics• Plurality: (van Deemter 2002)
– Extend each algorithm to search through disjunctions of increasing length
• Location: (van Deemter 2006)– Locatives treated as gradable: “the leftmost table/person”– E.g., suppose the referent x is located in column 3
=> “x is left of column 4”, “x is left of column 5” …=> “x is right of column 2”, “x is right of column 1”…
• Type:– People tend to use TYPE (Dale & Reiter 1995)– Here: All algorithms added TYPE.
![Page 16: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/16.jpg)
SELLC Winter School 2010
Evaluation aims
• Hypothesis in Dale & Reiter 1995: – IA resembles human output most
• Our main questions: – Is this true?– How important are parameters (PO) for the IA?
• More generally: – assess ‘quality’ of classic GRE algorithms :– calculate average match between the description
generated by an algorithm and the descriptions produced by people (for the same referent)
![Page 17: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/17.jpg)
SELLC Winter School 2010
Evaluation metric
• Dice Coefficient:
2 x |Common properties|
|total properties|
corpus: {A,B,C}
algorithm: {B,C} Dice = …
corpus: {A,B,C}
algorithm: {A,B,C,D} Dice = …
![Page 18: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/18.jpg)
SELLC Winter School 2010
Evaluation metric
• Dice Coefficient:
2 x |Common properties|
|total properties|
corpus: {A,B,C}
algorithm: {B,C} Dice = (2*2)/5 = 4/5
corpus: {A,B,C}
algorithm: {A,B,C,D} Dice = (2*3)/7 = 6/7
![Page 19: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/19.jpg)
SELLC Winter School 2010
Evaluation metric
• Dice Coefficient:
2 x |Common properties|
|total properties|
• A coefficient result of 1 indicates identical sets. 0 means no common terms
• We also used this to measure agreement between annotators of the corpus
![Page 20: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/20.jpg)
SELLC Winter School 2010
Assumptions behind DICE
• The discriminatory power of a description does not matter
• All properties are equidistant
See Gatt & Van Deemter 2007, “Content Determination in GRE: evaluating the evaluator”
![Page 21: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/21.jpg)
SELLC Winter School 2010
Evaluation (I): Furniture• Which preference orders for the IA?
– Psycholinguistic evidence:
• COLOUR >> {ORIENTATION, SIZE}(Pechmann 89; Eikmeyer & Ahlsen 96; Belke & Meyer 02)
• Y-DIMENSION >> X-DIMENSION(Bryant et al, 1992; Arts 2004)
• Split data: +LOCATION vs –LOCATION This talk: focus on –LOCATION –LOCATION = approx. 800 descriptions
• Compare algorithms to a randomized IA (RAND)
![Page 22: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/22.jpg)
SELLC Winter School 2010
Furniture: -LOCATION
SignificantSignificant
FB/GR
![Page 23: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/23.jpg)
SELLC Winter School 2010
Beyond Toy Domains• More on Furniture corpus:
Gatt et al. (ENLG-2007)
• With complex real-world objects:– Many different attributes can be used– Number of PO’s explodes– Few psycholinguistic precedents
• People domain attributes:– { hasBeard, hasGlasses, age, hasTie,
hasSuit, hasSuit, hasHair, hairColour, orientation }– 9 Attributes, so 9! = 362880 possible POs
![Page 24: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/24.jpg)
SELLC Winter School 2010
IA: Preference Orders for People Domain
• Little psycholinguistic evidence for choosing between all 362880 possible PO’s
• Focus on the most frequent Attributes: G=hasGlasses, B=hasBeard, H=hasHair, C=haircolour– Assumption: H and B must precede C– This leaves us with eight POs:
{ GBHC, GHBC,HBGC,HBCG, HGBC,BHGC, BHCG, BGHC }
![Page 25: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/25.jpg)
SELLC Winter School 2010
Preference Orders and frequency
Mean Sum
type 1.39 475
hasGlasses .68 231
hasBeard .66 226
HairColour .61 210
hasHair .46 158
orientation .21 73
age .10 34
hasTie .04 12
hasSuit .01 4
hasShirt .01 3
• For attributes other than {G,C,H,B}, we let corpus frequency determine the order
• E.g, IA-GBHC uses
type, G,B,H,C, age,
hasTie, hasSuit,hasShirt
as its PO
![Page 26: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/26.jpg)
SELLC Winter School 2010
Results People Domain
IA-BASE
Significant Significant by subjects
GR
![Page 27: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/27.jpg)
SELLC Winter School 2010
Results People domain
• IA_base performs very badly now
• So much about the best IA’s that start with {B,H,G,C} and end with <age,hasTie,hasSuit,hasShirt>
• Some of these did much worse:– IA_BHCG had DICE=0.6, making it
significantly worse (by subjects) than GR!
![Page 28: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/28.jpg)
SELLC Winter School 2010
Summary
• People domain gives much lower DICE scores than Furniture domain
• Difference between “good” and “bad” POs was – small (but significant) in the Furniture domain, – big (and significant) in the People domain
![Page 29: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/29.jpg)
SELLC Winter School 2010
Summary• The “Incremental Algorithm” (IA):
– not an algorithm but a class of algorithms
• The best IA beats all other algorithms, but the worst is very bad ...
• GR performs remarkably well.
• How to choose a suitable PO?– Furniture: few attributes; psycholinguistic precedent
• Still, there is variation.
– People: more attributes; no precedents• Variation even greater!
![Page 30: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/30.jpg)
SELLC Winter School 2010
Discussion• Suppose you want to build a GRE
algorithm for a new and complex domain, for which no transparent corpus is available.
• Psycholinguistic principles are unlikely to help you much
• If corpus is also not balanced, then frequency may not say much either …
![Page 31: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/31.jpg)
SELLC Winter School 2010
Other uses of this method: STEC
• Summer 2007: First NLG Shared task Evaluation Challenge (STEC)
• STEC involved GRE only, focussing on Content Determination
• 22 GRE Algorithms were submitted and evaluated (6 teams)
• Reported in UCNLG+MT workshop, Copenhagen, Sept 2007
![Page 32: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/32.jpg)
SELLC Winter School 2010
Other uses of this corpus: STEC
• An even bigger STEC one year later
• Each algorithm was compared with the TUNA corpus (minus 40% training set) – Both Furniture and People domain – DICE measured “humanlikeness”– Singulars only
• Each algorithm was also tested in terms of identification time (by human reader)
![Page 33: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/33.jpg)
SELLC Winter School 2010
Some STEC results
1. The more minimal the descriptions generated by these 22 systems were, the worse their DICE scores were
![Page 34: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/34.jpg)
SELLC Winter School 2010
![Page 35: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/35.jpg)
SELLC Winter School 2010
2. No relation between humanlikeness and identification time
– Best system in terms of DICE was worst-but-one in terms of identification time
• More research needed on the different criteria for judging NLG output
![Page 36: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/36.jpg)
SELLC Winter School 2010
Thank you
![Page 37: Evaluating Algorithms for GRE](https://reader036.vdocuments.us/reader036/viewer/2022062519/5681554f550346895dc31bfb/html5/thumbnails/37.jpg)
SELLC Winter School 2010
Annotator agreement
• Semantic markup was applied manually to all descriptions in the corpus.
• 2 annotators were given a stratified random sample
• Comparison used Dice.
mean mode
Furniture 0.89 (A/B)
1 (71.1%)
Annotator A 0.93 (A/us)
1 (74.4%)
Annotator B 0.92 (B/us)
1(73%)
People 0.89 (A/B)
1(70%)
Annotator A 0.84 (A/us)
1(41.1%)
Annotator B .78 (B/us)
1(36.3%)