1 lessons learned from evaluation of summarization systems: nightmares and pleasant surprises...

1

Lessons Learned from Evaluation of Lessons Learned from Evaluation of Summarization Systems:Summarization Systems: Nightmares and Pleasant SurprisesNightmares and Pleasant Surprises

Kathleen McKeown

Department of Computer Science

Columbia University

Major contributers: Ani Nenkova, Becky Passonneau

3

QuestionsQuestions

What kinds of evaluation are possible?

What are the pitfalls? Are evaluation metrics fair? Is real research progress possible?

What are the benefits?

Should we evaluate our systems?

4

What is the feel of the evaluation?What is the feel of the evaluation?

Is it competitive?

Does it foster a feeling of community?

Are the guidelines clearly established ahead of time?

Are the metrics fair? Do they measure what you want to measure?

6

The night Max wore his wolf suit and made mischief of one kind

7

and another and another

8

His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.

9

DARPA GALE: Global Autonomous DARPA GALE: Global Autonomous Language EnvironmentLanguage Environment Three large teams: BBN, IBM, SRI

SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State

Generate responses to open-ended questions 17 templates: definitions, biographies, events, relationships, reactions,

etc.

Using English, Chinese, and Arabic text and speech, blogs to news

Find all instances when a fact is mentioned (redundancy)

10

GALE EvaluationGALE Evaluation

Can systems do at least 50% as well as a human? If not, the GALE program will not continue The team that does worst may be cut

Independent evaluator: BAE Has never done text evaluation before Has experience with task based evaluation

Gold Standard System responses graded by two judges Relevant facts added to the pool

Granularity of scoring: nuggets Metrics

Variants of precision/recall weighted Document citations Redundancy

11

Year 1: Sample Q&AYear 1: Sample Q&A

LIST FACTS ABOUT [The Trial of Saddam Hussein]

The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi)

His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail , appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths.

A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair.

As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 .

12

Year 1: ResultsYear 1: Results

F-value (Beta of 1)

Machine average: 0.230 Human average: 0.353

Machine to Human average: 0.678

13

DUC – Document Understanding DUC – Document Understanding ConferenceConference

Established and funded by DARPA TIDES Run by independent evaluator NIST

Open to summarization community Annual evaluations on common datasets 2001-present

Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization Update summarization

14

DUC is changing direction againDUC is changing direction again

DARPA GALE effort cutting back participation in DUC

Considering co-locating with TREC QA

Considering new data sources and tasks

15

DUC EvaluationDUC Evaluation

Gold Standard Human summaries written by NIST From 2 to 9 summaries per input set

Multiple metrics Manual

Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions

Automatic Rouge (-1, -2, -skipbigrams, LCS, BE)

Granularity Manual: sub-sentential elements Automatic: sentences

16

TREC definition pilotTREC definition pilot

Long answer to request for a definition

As a pilot, less emphasis on results

Part of TREC QA

17

Evaluation MethodsEvaluation Methods

Pool system responses and break into nuggets

A judge scores nuggets as vital, OK or invalid

Measure information precision and recall

Can a judge reliably determine which facts belong in a definition?

18

Considerations Across EvaluationsConsiderations Across Evaluations

Independent evaluator Not always as knowledgeable as researchers Impartial determination of approach Extensive collection of resources

Determination of task Appealing to a broad cross-section of community Changes over time

DUC 2001-2002 Single and multi-document DUC 2003: headlines, multi-document DUC 2004: headlines, multilingual and multi-document, focused DUC 2005: focused summarization DUC 2006: focused and a new task, up for discussion

How long do participants have to prepare? When is a task dropped?

Scoring of text at the sub-sentential level

19

Task-based EvaluationTask-based Evaluation

Use the summarization system as browser to do another task

Newsblaster: write a report given a broad prompt

DARPA utility evaluation: given a request for information, use question answering to write report

20

Task EvaluationTask Evaluation

Hypothesis: multi-document summaries enable users to find information efficiently

Task: fact-gathering given topic and questions Resembles intelligence analyst task

21

User Study: ObjectivesUser Study: Objectives

Does multi-document summarization help?

Do summaries help the user find information needed to perform a report writing task?

Do users use information from summaries in gathering their facts?

Do summaries increase user satisfaction with the online news system?

Do users create better quality reports with summaries? How do full multi-document summaries compare with

minimal 1-sentence summaries such as Google News?

22

User Study: DesignUser Study: Design

Compared 4 parallel news browsing systems Level 1: Source documents only Level 2: One sentence multi-document summaries (e.g.,

Google News) linked to documents Level 3: Newsblaster multi-document summaries linked

to documents Level 4: Human written multi-document summaries

linked to documents

All groups write reports given four scenarios A task similar to analysts Can only use Newsblaster for research Time-restricted

23

User Study: ExecutionUser Study: Execution

4 scenarios 4 event clusters each 2 directly relevant, 2 peripherally relevant Average 10 documents/cluster

45 participants Balance between liberal arts, engineering 138 reports

Exit survey Multiple-choice and open-ended questions

Usage tracking Each click logged, on or off-site

24

““Geneva” PromptGeneva” Prompt

The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by ……

Who participated in the negotiations that produced the Geneva Accord?

Apart from direct participants, who supported the Geneva Accord preparations and how?

What has the response been to the Geneva Accord by the Palestinians?

25

Measuring EffectivenessMeasuring Effectiveness

Score report content and compare across summary conditions

Compare user satisfaction per summary condition

Comparing where subjects took report content from

26

Newsblaster

27

User SatisfactionUser Satisfaction

More effective than a web search with Newsblaster

Not true with documents only or single-sentence summaries

Easier to complete the task with summaries than with documents only

Enough time with summaries than documents only

Summaries helped most 5% single sentence summaries 24% Newsblaster summaries 43% human summaries

28

User Study: ConclusionsUser Study: Conclusions

Summaries measurably improve a news browser’s effectiveness for research

Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News

Users want search Not included in evaluation

29

Potential ProblemsPotential Problems

30 That very night in Max’s room a forest grew

31 And grew

32

And grew until the ceiling hung with vines and the walls became the world all around

33

And an ocean tumbled by with a private boat for Maxand he sailed all through the night and day

34

And he sailed in and out of weeks and almost over a yearto where the wild things are

35

And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth

36

Comparing Text Against TextComparing Text Against Text

Which human summary makes a good gold standard? Many summaries are good

At what granularity is the comparison made?

When can we say that two pieces of text match?

37

Measuring variation Measuring variation

Types of variation between humans

Applications

Translation same content

different wording

Summarization different content??

different wording

Generation different content??

different wording

38

Human variation: content Human variation: content words (Ani Nenkova)words (Ani Nenkova)

• Summaries differ in vocabulary Differences cannot be explained by paraphrase

•7 translations 20 documents

•7 summaries 20 document sets

• Faster vocabulary growth in summarization

39

Variation impacts evaluationVariation impacts evaluation

Comparing content is hard All kinds of judgment calls

Paraphrases VP vs. NP

Ministers have been exchanged Reciprocal ministerial visits

Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants

40

NightmareNightmare: only one gold standard: only one gold standard

System may have chosen an equally good sentence but not in the one gold standard Pinochet arrested in London on Oct 16 at a Spanish judge’s

request for atrocities against Spaniards in Chile. Former Chilean dictator Augusto Pinochet has been

arrested in London at the request of the Spanish government

In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)

Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)

DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries

41

How many summaries are How many summaries are enough?enough?

42

ScoringScoring

Two main approaches used in DUC

ROUGE (Lin and Hovy)

Pyramids (Nenkova and Passonneau)

Problems: Are the results stable? How difficult is it to do the scoring?

43

ROUGE: ROUGE: RRecall-ecall-OOriented riented UUnderstudy for nderstudy for GGisting isting EEvaluationvaluation

Rouge – Ngram co-occurrence metrics measuring content overlap

Counts of n-gram overlaps between candidate and model

summaries

Total n-grams in summary model

44

ROUGEROUGE Experimentation with different units of comparison:

unigrams, bigrams, longest common substring, skip-bigams, basic elements

Automatic and thus easy to apply

Important to consider confidence intervals when determining differences between systems Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to

definitively say one is better than another

Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination

Good for training regardless of intervals: can see trends

45

PyramidsPyramids Uses multiple human summaries Information is ranked by its importance Allows for multiple good summaries A pyramid is created from the human

summaries Elements of the pyramid are content units System summaries are scored by comparison

with the pyramid

46

Content units: better study of Content units: better study of variation than sentencesvariation than sentences

Semantic units

Link different surface realizations with the same meaning

Emerge from the comparison of several texts

47

Content unit exampleContent unit example

S1 Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.

S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government.

S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London.

48

SCU: SCU: A cable car caught fireA cable car caught fire (Weight = 4)(Weight = 4)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a

mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.

C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.

D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

49

SCU: SCU: The cause of the fire is The cause of the fire is unknownunknown (Weight = 1) (Weight = 1)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a

mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.

C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.

D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

50

Idealized representationIdealized representation

Tiers of differentially weighted SCUs

Top: few SCUs, high weight

Bottom: many SCUs, low weight

W=1

W=2

W=3

51

Comparison of Scoring Methods Comparison of Scoring Methods in DUC05in DUC05 Analysis of scores for the 20 pyramid sets

Columbia prepared pyramids Participants scored systems against pyramids

Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4

Pyramids score computed from multiple humans Responsiveness is just one human’s judgment Rouge-SU4 equivalent to Rouge-2

52

Creation of pyramids Creation of pyramids

Done at Columbia for each of 20 out of 50 sets

Primary annotator, secondary checker

Held round-table discussions of problematic constructions that occurred in this data set

Comma separated lists Extractive reserves have been formed for managed harvesting of

timber, rubber, Brazil nuts, and medical plants without deforestation.

General vs. specific Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

53

Characteristics of the ResponsesCharacteristics of the Responses

Proportion of SCUs of Weight 1 is large 44% (D324) to 81% (D695)

Mean SCU weight: 1.9

Agreement among human responders is quite low

54 SCU Weights

# of SCUs at each weight

55

Preview of ResultsPreview of Results

Manual metrics Large differences between humans and machines

No single system the clear winner But a top group identified by all metrics

Significant differences Different predictions from manual and automatic metrics

Correlations between metrics Some correlation but one cannot be substituted for another This is good

56

Human performance/Best sysHuman performance/Best sys

Pyramid Modified Resp ROUGE-SU4

B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552~~~~~~~~~~~~~~~~~

14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics

Best system ~80% of human performance on ROUGE

57

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

58


59


60


61

Significant DifferencesSignificant Differences

Manual metrics Few differences between systems

Pyramid: 23 is worse Responsive: 23 and 31 are worse

Both humans better than all systems

Automatic (Rouge-SU4) More differences between systems One human indistinguishable from 5 systems

62

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

63

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

Questionable that responsiveness could be a gold standard

64

Pyramid and responsivenessPyramid and responsiveness


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

65

Pyramid and RougePyramid and Rouge


Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

66

CorrelationsCorrelations

Original and modified can substitute for each other

High correlation between manual and automatic, but automatic not yet a substitute

Similar patterns between pyramid and responsiveness

67

NightmareNightmare

Scoring metric that is not stable used to decide funding

Insignificant differences between systems determine funding

68

Is Task Evaluation Nightmare Is Task Evaluation Nightmare Free?Free?

Impact of user interface issues Can have more impact than the summary

Controlling for proper mix of subjects

Quantity of subjects and time to carry out is large

69

Till Max said “Be still!” and tamed them with the magic trick

70

Of staring into their yellow eyes without blinking onceAnd they were frightened and called him the most wild thing of all

71 And made him king of all wild things

72 “And now,” cried Max “Let the wild rumpus start!”

76

Are we having fun yet?Are we having fun yet?Benefits of evaluationBenefits of evaluation Emergence of evaluation methods

ROUGE Pyramids Nuggetteer

Research into characteristics of metrics

Analyses of sub-sentential units

Paraphrase as a research issue

77

Available DataAvailable Data

DUC data sets 4 years of summary/document set pairs

Multidocument summarization training data not available beforehand

4 years of scoring patterns Led to analysis of human summaries

Pyramids Pyramids and peers for 40 topics (DUC04, DUC05) Many more from Nenkova and Passonneau Training data for paraphrase Training data for abstraction -> see systems moving

away from pure sentence extraction

78

Wrapping upWrapping up

79

Lessons LearnedLessons Learned

Evaluation environment is important Find a task with broad appeal Use independent evaluator At least a committee

Use multiple gold standards Compare text at the content unit level Evaluate the metrics

Look at significant differences

80

Is Evaluation Worth It?Is Evaluation Worth It?

DUC: creation of a community From ~15 participants year 1 -> 30 participants year 5 No longer impacts funding

Enables research into evaluation At start, no idea how to evaluate summaries

But, results do not tell us everything

81

And he sailed back over a year, in and out of weeks and through a day

82

And into the night of his very own room where he found his supper waiting for him .. And it was still warm.

1 lessons learned from evaluation of summarization systems: nightmares and pleasant surprises...

Documents