1 lessons learned from evaluation of summarization systems: nightmares and pleasant surprises...

82
1 Lessons Learned from Evaluation of Lessons Learned from Evaluation of Summarization Systems: Summarization Systems: Nightmares and Pleasant Surprises Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia University Major contributers: Ani Nenkova, Becky Passonneau

Upload: madeleine-oneal

Post on 16-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

1

Lessons Learned from Evaluation of Lessons Learned from Evaluation of Summarization Systems:Summarization Systems: Nightmares and Pleasant SurprisesNightmares and Pleasant Surprises

Kathleen McKeown

Department of Computer Science

Columbia University

Major contributers: Ani Nenkova, Becky Passonneau

Page 2: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

2

Page 3: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

3

QuestionsQuestions

What kinds of evaluation are possible?

What are the pitfalls? Are evaluation metrics fair? Is real research progress possible?

What are the benefits?

Should we evaluate our systems?

Page 4: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

4

What is the feel of the evaluation?What is the feel of the evaluation?

Is it competitive?

Does it foster a feeling of community?

Are the guidelines clearly established ahead of time?

Are the metrics fair? Do they measure what you want to measure?

Page 5: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

5

Page 6: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

6

The night Max wore his wolf suit and made mischief of one kind

Page 7: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

7

and another and another

Page 8: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

8

His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.

Page 9: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

9

DARPA GALE: Global Autonomous DARPA GALE: Global Autonomous Language EnvironmentLanguage Environment Three large teams: BBN, IBM, SRI

SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State

Generate responses to open-ended questions 17 templates: definitions, biographies, events, relationships, reactions,

etc.

Using English, Chinese, and Arabic text and speech, blogs to news

Find all instances when a fact is mentioned (redundancy)

Page 10: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

10

GALE EvaluationGALE Evaluation

Can systems do at least 50% as well as a human? If not, the GALE program will not continue The team that does worst may be cut

Independent evaluator: BAE Has never done text evaluation before Has experience with task based evaluation

Gold Standard System responses graded by two judges Relevant facts added to the pool

Granularity of scoring: nuggets Metrics

Variants of precision/recall weighted Document citations Redundancy

Page 11: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

11

Year 1: Sample Q&AYear 1: Sample Q&A

LIST FACTS ABOUT [The Trial of Saddam Hussein]

The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi)

His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail , appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths.

A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair.

As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 .

Page 12: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

12

Year 1: ResultsYear 1: Results

F-value (Beta of 1)

Machine average: 0.230 Human average: 0.353

Machine to Human average: 0.678

Page 13: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

13

DUC – Document Understanding DUC – Document Understanding ConferenceConference

Established and funded by DARPA TIDES Run by independent evaluator NIST

Open to summarization community Annual evaluations on common datasets 2001-present

Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization Update summarization

Page 14: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

14

DUC is changing direction againDUC is changing direction again

DARPA GALE effort cutting back participation in DUC

Considering co-locating with TREC QA

Considering new data sources and tasks

Page 15: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

15

DUC EvaluationDUC Evaluation

Gold Standard Human summaries written by NIST From 2 to 9 summaries per input set

Multiple metrics Manual

Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions

Automatic Rouge (-1, -2, -skipbigrams, LCS, BE)

Granularity Manual: sub-sentential elements Automatic: sentences

Page 16: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

16

TREC definition pilotTREC definition pilot

Long answer to request for a definition

As a pilot, less emphasis on results

Part of TREC QA

Page 17: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

17

Evaluation MethodsEvaluation Methods

Pool system responses and break into nuggets

A judge scores nuggets as vital, OK or invalid

Measure information precision and recall

Can a judge reliably determine which facts belong in a definition?

Page 18: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

18

Considerations Across EvaluationsConsiderations Across Evaluations

Independent evaluator Not always as knowledgeable as researchers Impartial determination of approach Extensive collection of resources

Determination of task Appealing to a broad cross-section of community Changes over time

DUC 2001-2002 Single and multi-document DUC 2003: headlines, multi-document DUC 2004: headlines, multilingual and multi-document, focused DUC 2005: focused summarization DUC 2006: focused and a new task, up for discussion

How long do participants have to prepare? When is a task dropped?

Scoring of text at the sub-sentential level

Page 19: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

19

Task-based EvaluationTask-based Evaluation

Use the summarization system as browser to do another task

Newsblaster: write a report given a broad prompt

DARPA utility evaluation: given a request for information, use question answering to write report

Page 20: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

20

Task EvaluationTask Evaluation

Hypothesis: multi-document summaries enable users to find information efficiently

Task: fact-gathering given topic and questions Resembles intelligence analyst task

Page 21: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

21

User Study: ObjectivesUser Study: Objectives

Does multi-document summarization help?

Do summaries help the user find information needed to perform a report writing task?

Do users use information from summaries in gathering their facts?

Do summaries increase user satisfaction with the online news system?

Do users create better quality reports with summaries? How do full multi-document summaries compare with

minimal 1-sentence summaries such as Google News?

Page 22: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

22

User Study: DesignUser Study: Design

Compared 4 parallel news browsing systems Level 1: Source documents only Level 2: One sentence multi-document summaries (e.g.,

Google News) linked to documents Level 3: Newsblaster multi-document summaries linked

to documents Level 4: Human written multi-document summaries

linked to documents

All groups write reports given four scenarios A task similar to analysts Can only use Newsblaster for research Time-restricted

Page 23: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

23

User Study: ExecutionUser Study: Execution

4 scenarios 4 event clusters each 2 directly relevant, 2 peripherally relevant Average 10 documents/cluster

45 participants Balance between liberal arts, engineering 138 reports

Exit survey Multiple-choice and open-ended questions

Usage tracking Each click logged, on or off-site

Page 24: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

24

““Geneva” PromptGeneva” Prompt

The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by ……

Who participated in the negotiations that produced the Geneva Accord?

Apart from direct participants, who supported the Geneva Accord preparations and how?

What has the response been to the Geneva Accord by the Palestinians?

Page 25: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

25

Measuring EffectivenessMeasuring Effectiveness

Score report content and compare across summary conditions

Compare user satisfaction per summary condition

Comparing where subjects took report content from

Page 26: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

26

Newsblaster

Page 27: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

27

User SatisfactionUser Satisfaction

More effective than a web search with Newsblaster

Not true with documents only or single-sentence summaries

Easier to complete the task with summaries than with documents only

Enough time with summaries than documents only

Summaries helped most 5% single sentence summaries 24% Newsblaster summaries 43% human summaries

Page 28: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

28

User Study: ConclusionsUser Study: Conclusions

Summaries measurably improve a news browser’s effectiveness for research

Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News

Users want search Not included in evaluation

Page 29: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

29

Potential ProblemsPotential Problems

Page 30: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

30 That very night in Max’s room a forest grew

Page 31: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

31 And grew

Page 32: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

32

And grew until the ceiling hung with vines and the walls became the world all around

Page 33: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

33

And an ocean tumbled by with a private boat for Maxand he sailed all through the night and day

Page 34: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

34

And he sailed in and out of weeks and almost over a yearto where the wild things are

Page 35: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

35

And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth

Page 36: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

36

Comparing Text Against TextComparing Text Against Text

Which human summary makes a good gold standard? Many summaries are good

At what granularity is the comparison made?

When can we say that two pieces of text match?

Page 37: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

37

Measuring variation Measuring variation

Types of variation between humans

Applications

Translation same content

different wording

Summarization different content??

different wording

Generation different content??

different wording

Page 38: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

38

Human variation: content Human variation: content words (Ani Nenkova)words (Ani Nenkova)

• Summaries differ in vocabulary Differences cannot be explained by paraphrase

•7 translations 20 documents

•7 summaries 20 document sets

• Faster vocabulary growth in summarization

Page 39: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

39

Variation impacts evaluationVariation impacts evaluation

Comparing content is hard All kinds of judgment calls

Paraphrases VP vs. NP

Ministers have been exchanged Reciprocal ministerial visits

Length and constituent type Robotics assists doctors in the medical operating theater Surgeons started using robotic assistants

Page 40: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

40

NightmareNightmare: only one gold standard: only one gold standard

System may have chosen an equally good sentence but not in the one gold standard Pinochet arrested in London on Oct 16 at a Spanish judge’s

request for atrocities against Spaniards in Chile. Former Chilean dictator Augusto Pinochet has been

arrested in London at the request of the Spanish government

In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)

Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)

DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries

Page 41: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

41

How many summaries are How many summaries are enough?enough?

Page 42: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

42

ScoringScoring

Two main approaches used in DUC

ROUGE (Lin and Hovy)

Pyramids (Nenkova and Passonneau)

Problems: Are the results stable? How difficult is it to do the scoring?

Page 43: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

43

ROUGE: ROUGE: RRecall-ecall-OOriented riented UUnderstudy for nderstudy for GGisting isting EEvaluationvaluation

Rouge – Ngram co-occurrence metrics measuring content overlap

Counts of n-gram overlaps between candidate and model

summaries

Total n-grams in summary model

Page 44: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

44

ROUGEROUGE Experimentation with different units of comparison:

unigrams, bigrams, longest common substring, skip-bigams, basic elements

Automatic and thus easy to apply

Important to consider confidence intervals when determining differences between systems Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to

definitively say one is better than another

Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination

Good for training regardless of intervals: can see trends

Page 45: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

45

PyramidsPyramids Uses multiple human summaries Information is ranked by its importance Allows for multiple good summaries A pyramid is created from the human

summaries Elements of the pyramid are content units System summaries are scored by comparison

with the pyramid

Page 46: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

46

Content units: better study of Content units: better study of variation than sentencesvariation than sentences

Semantic units

Link different surface realizations with the same meaning

Emerge from the comparison of several texts

Page 47: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

47

Content unit exampleContent unit example

S1 Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.

S2 Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government.

S3 Britain caused international controversy and Chilean turmoil by arresting former Chilean dictator Pinochet in London.

Page 48: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

48

SCU: SCU: A cable car caught fireA cable car caught fire (Weight = 4)(Weight = 4)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a

mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.

C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.

D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

Page 49: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

49

SCU: SCU: The cause of the fire is The cause of the fire is unknownunknown (Weight = 1) (Weight = 1)A. The cause of the fire was unknown.B. A cable car caught fire just after entering a

mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000.

C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people.

D. On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

Page 50: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

50

Idealized representationIdealized representation

Tiers of differentially weighted SCUs

Top: few SCUs, high weight

Bottom: many SCUs, low weight

W=1

W=2

W=3

Page 51: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

51

Comparison of Scoring Methods Comparison of Scoring Methods in DUC05in DUC05 Analysis of scores for the 20 pyramid sets

Columbia prepared pyramids Participants scored systems against pyramids

Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4

Pyramids score computed from multiple humans Responsiveness is just one human’s judgment Rouge-SU4 equivalent to Rouge-2

Page 52: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

52

Creation of pyramids Creation of pyramids

Done at Columbia for each of 20 out of 50 sets

Primary annotator, secondary checker

Held round-table discussions of problematic constructions that occurred in this data set

Comma separated lists Extractive reserves have been formed for managed harvesting of

timber, rubber, Brazil nuts, and medical plants without deforestation.

General vs. specific Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

Page 53: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

53

Characteristics of the ResponsesCharacteristics of the Responses

Proportion of SCUs of Weight 1 is large 44% (D324) to 81% (D695)

Mean SCU weight: 1.9

Agreement among human responders is quite low

Page 54: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

54 SCU Weights

# of SCUs at each weight

Page 55: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

55

Preview of ResultsPreview of Results

Manual metrics Large differences between humans and machines

No single system the clear winner But a top group identified by all metrics

Significant differences Different predictions from manual and automatic metrics

Correlations between metrics Some correlation but one cannot be substituted for another This is good

Page 56: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

56

Human performance/Best sysHuman performance/Best sys

Pyramid Modified Resp ROUGE-SU4

B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552~~~~~~~~~~~~~~~~~

14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics

Best system ~80% of human performance on ROUGE

Page 57: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

57

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

Page 58: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

58

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

Page 59: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

59

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

Page 60: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

60

Pyramid original Modified Resp Rouge-SU414: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.127816: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

Page 61: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

61

Significant DifferencesSignificant Differences

Manual metrics Few differences between systems

Pyramid: 23 is worse Responsive: 23 and 31 are worse

Both humans better than all systems

Automatic (Rouge-SU4) More differences between systems One human indistinguishable from 5 systems

Page 62: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

62

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

Page 63: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

63

Correlations: Pearson’s, 25 Correlations: Pearson’s, 25 systemssystems

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

Questionable that responsiveness could be a gold standard

Page 64: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

64

Pyramid and responsivenessPyramid and responsiveness

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

Page 65: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

65

Pyramid and RougePyramid and Rouge

Pyr-mod Resp-1 Resp2 R-2 R-SU4

Pyr-orig 0.96 0.77 0.86 0.84 0.80

Pyr-mod 0.81 0.90 0.90 0.86

Resp-1 0.83 0.92 0.92

Resp-2 0.88 0.87

R-2 0.98

High correlation, but the metrics are not mutually substitutable

Page 66: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

66

CorrelationsCorrelations

Original and modified can substitute for each other

High correlation between manual and automatic, but automatic not yet a substitute

Similar patterns between pyramid and responsiveness

Page 67: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

67

NightmareNightmare

Scoring metric that is not stable used to decide funding

Insignificant differences between systems determine funding

Page 68: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

68

Is Task Evaluation Nightmare Is Task Evaluation Nightmare Free?Free?

Impact of user interface issues Can have more impact than the summary

Controlling for proper mix of subjects

Quantity of subjects and time to carry out is large

Page 69: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

69

Till Max said “Be still!” and tamed them with the magic trick

Page 70: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

70

Of staring into their yellow eyes without blinking onceAnd they were frightened and called him the most wild thing of all

Page 71: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

71 And made him king of all wild things

Page 72: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

72 “And now,” cried Max “Let the wild rumpus start!”

Page 73: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

73

Page 74: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

74

Page 75: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

75

Page 76: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

76

Are we having fun yet?Are we having fun yet?Benefits of evaluationBenefits of evaluation Emergence of evaluation methods

ROUGE Pyramids Nuggetteer

Research into characteristics of metrics

Analyses of sub-sentential units

Paraphrase as a research issue

Page 77: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

77

Available DataAvailable Data

DUC data sets 4 years of summary/document set pairs

Multidocument summarization training data not available beforehand

4 years of scoring patterns Led to analysis of human summaries

Pyramids Pyramids and peers for 40 topics (DUC04, DUC05) Many more from Nenkova and Passonneau Training data for paraphrase Training data for abstraction -> see systems moving

away from pure sentence extraction

Page 78: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

78

Wrapping upWrapping up

Page 79: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

79

Lessons LearnedLessons Learned

Evaluation environment is important Find a task with broad appeal Use independent evaluator At least a committee

Use multiple gold standards Compare text at the content unit level Evaluate the metrics

Look at significant differences

Page 80: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

80

Is Evaluation Worth It?Is Evaluation Worth It?

DUC: creation of a community From ~15 participants year 1 -> 30 participants year 5 No longer impacts funding

Enables research into evaluation At start, no idea how to evaluate summaries

But, results do not tell us everything

Page 81: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

81

And he sailed back over a year, in and out of weeks and through a day

Page 82: 1 Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia

82

And into the night of his very own room where he found his supper waiting for him .. And it was still warm.