1 evaluating summary content selection pyramid method: work in progress rebecca passonneau ani...
Post on 19-Dec-2015
216 views
TRANSCRIPT
1
Evaluating Summary Evaluating Summary Content SelectionContent Selection
Pyramid Method: Work in Pyramid Method: Work in ProgressProgress
Rebecca Passonneau
Ani Nenkova
2
OUTLINEOUTLINE
1. Motivation
2. Problems
3. DUC Evaluations
4. Pyramid Method: Current Status
5. Open Issues
6. Conclusions
3
EVALUATION GOALSEVALUATION GOALS Define parameters of the problem
o What is summarization?
Compare systemso Is the metric meaningful?
Track progresso When does output improve?
Cost Effectivenesso Can it be (partly) automated?
4
PICTURING CONTENT PICTURING CONTENT “OVERLAP”“OVERLAP”
Philippine Airlines (PAL) experienced a crisis in 1998. Unable to make payments on a $2.1 billion debt, it was faced by a pilot's strike in June and the region's currency problems which reduced passenger numbers and inflated costs. On September 23 PAL shut down after the ground crew union turned down a settlement which it accepted two . . .
Starting in May 1998, Philippine Airlines (PAL) laid off 5000 of its 13,000 workers. A 3-week pilots' strike in June and a currency crisis that reduced passenger numbers made payments on PAL's $2 billion debt debt impossible. President Estrada brokered an agreement to suspend collective bargaining for 10 years in exchange for 20% of PAL stock and union seats on its board.The large ground crew union initially voted no.After PAL shut down operations for 13 days starting Sept. 23rd, leaving much of the country without air service and foreign . . .
5
OBSTACLESOBSTACLES
Humans select different content
Humans present same content differently
Lack clear standard of “good” summary
[Contrasts with translation: L1(C)L2(C)]
Need objective method to get at subjective notion of what a summary IS
6
PREVIOUS WORK: PessimismPREVIOUS WORK: PessimismHuman Judgments
Extraction Low Agreement (Rath, 1961; Salton et al, 1997) Inconsistent over time (Rath, 1961; Lin & Hovy,
2002)
Abstraction (Depends on individual’s orientation (Gerrig et al1991)
Automated Evaluation
Extraction (Pastra & Saggion, 2003 EACL) 3-humans; multiple “models”; inconclusive
Abstraction (Lin & Hovy, 2002 ACL) Accepts inconsistent judgments as target Difficult to extend
7
PREVIOUS WORK: OptimismPREVIOUS WORK: Optimism
Good design methodology leads to better understanding areas of agreement
High compression rate leads to high agreement (Jing et al., 1998)
Content variation offset by logarithmic growth in pool of distinct content units (Halteren & Teufel,2003)
Content can be reliably annotated (Beck et al., 1991)
8
HOW TO GET AT “CONTENT” HOW TO GET AT “CONTENT” FROM ITS “EXPRESSION”FROM ITS “EXPRESSION”
1. ADAPT BLEU MT EVALUATIONa) Collect multiple “model” summariesb) Quantify ngram overlap
2. IDENTIFY ABSTRACT CONTENT UNITSa) DUC
b) Reading Comprehension
3. A THIRD WAYa) Content unit “level”b) Multiple expressions of same content
unit
9
DUC: THE CURRENT DUC: THE CURRENT APPROACHAPPROACH
Yearly evaluation of systems on new data sets
NIST evaluations performed by humans
Widely cited results
Does it work?• Compare current systems • Track individual system progress • Track community progress from year to year• Identify specific strengths/weaknesses• Can it eventually be automated?
10
DUC SCORING METHODDUC SCORING METHOD
Datasets: human/machine summaries
Designate “model” human summary
(Automatically) identify content units in “model” summary
Split “peer” summaries into sentences
Human judges evaluate “peer” against model
11
COMPUTE DUC SCORESCOMPUTE DUC SCORES
1. For each EDU:a) Does peer sentence express any partb) How much? (0, 20, 40, 60, 80, 100%)
2. Average EDU percent overlap scores
3. Resulting score ranges from 0 to 1
12
DRAWBACKS TO DUC DRAWBACKS TO DUC SCORESSCORES
• Very sensitive to choice of “model”
• All “model” units created equal
• Difficult to interpret scoreso Human summary scores as low as 0.1o Scores vary for same summarizero Scores vary for same summary
• Systems cannot be differentiated
13
DUC SCATTERPLOTDUC SCATTERPLOT
10 DUC Summary Evaluators
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 10 20 30 40
Summary ID (1 to 30)
Sco
re
Summarizer A
Summarizer B
Summarizer C
Summarizer D
Summarizer E
Summarizer F
Summarizer G
Summarizer H
Summarizer I
Summarizer J
14
FOUNDATION OF PYRAMIDFOUNDATION OF PYRAMID
A few CUs appear in many summaries
Humans can identify same/different CUs
Weight CUs differentially
15
MULTIPLE GOOD SUMMARIESMULTIPLE GOOD SUMMARIES
This pyramid predicts 6 different good summaries consisting of 4 SCUs:
16
SCU ANNOTATION EXAMPLESCU ANNOTATION EXAMPLE
A.2 Unable to make payments on a $2 billion debt
H.2 made payments on PAL’s $2 billion debt impossible
I.1 With a rising $2.1 billion debt J .3 PAL is buried under a $2.2 billion dollar debt
it cannot repay
SCU1 W=4 PAL has a debt of over $2 billion
SCU2 W=3 PAL cannot make its payments
17
PAL PYRAMID TIER: W=3 PAL PYRAMID TIER: W=3 (N=4)(N=4)
SCU1: PAL has $2.1 billion debt
H2 [PAL’s $2 billion debt]1
I1 [and with a rising $2.1 billion debt,]1
J3 [PAL is buried under a $2.2 billion dollar debt]1
SCU2: PAL enforced a shutdown
H5 [After PAL shut down operations]2
I1 [stopped all operations]2
J5 [by a]2 [shutdown]2
SCU3: PAL in crisis
H1 [Philippine Airlines]3
I1 [Philippines Airlines (PAL),]3 [devastated]3
J1 [The fate]3 [is uncertain.]3
18
PAL PYRAMID TIER: W=2 PAL PYRAMID TIER: W=2 (N=8)(N=8)
SCU5: PAL unable to repay debtH2 [made payments on]5 [impossible.]5J3 [it cannot repay]5 SCU6: PAL experienced pilots' strikeH2 [A]5 [pilots' strike]6I1 [by pilot]5 [strikes]6 SCU7: this PAL crisis occurred in 1988
H1 [1998,]7I1 [in 1998]7
. . .
19
ANNOTATION: KEEPING ANNOTATION: KEEPING TRACKTRACK
H1 [Starting in May]23 [1998,]7 [Philippine Airlines]3
[laid off 5000 of its 13,000 workers.]24
H2 [A]6 [3-week]25 [pilots' strike]6 [in June]11 [and a
currency crisis]12 [that reduced passenger numbers]13
H3 [President Estrada brokered an agreement to suspend
collective bargaining for 10 years]17 [in exchange
for 20% of PAL stock and union seats on its board.]26
H4 [The large ground crew union initially voted no.]18
H5 [After PAL shut down operations]2 [for 13 days]4
[starting Sept. 23rd,]8 [leaving much of the country
without air service]27 [and foreign carriers flying
some domestic routes,]9 [61% voted yes.]19
. . .
20
RELIABILITYRELIABILITY
Two Annotators Consensus Annotation
Number of SCUs: 33 versus 37 35
Count of Pairwise Agreements (PAs) SCU Label SCU Members
Comparison of Annotations to Consensus Recall/Precision not valid 65/69 PAs Most “disagreements” due to membership size Only 2 “conflicts”
21
ANOTHER CONSISTENCY ANOTHER CONSISTENCY TESTTEST
Pyramid A H C J
Consensus .95 .89 .85 .76
Annotation1 .97 .87 .83 .82
Annotation2 .94 .87 .84 .74
22
PYRAMID SCORE PART 1PYRAMID SCORE PART 1 1. For N summaries, score each “peer”
against a pyramid with N-1 tiers2. “Peer” annotation
a) Gives SCU “size”
b) Yields a residue of SCUs not in pyramid
3. Compute D (Observed distribution) where D=sum of weights of SCUs
EG: Summary A (D30042), size=20D=(6x3) + (6x2) + (4x1) + (4x0) = 34
23
PYRAMID SCORE PART IIPYRAMID SCORE PART II
1. Compute Max = Ideal Sum of weights of SCUs, given the summary SCU size
2. Pyramid of H,I,J:
a) 9 SCUs in tier, w=3b) 10 SCUs in tier, w=2c) 12 SCUs in tier, w=1
3. Size=20, Max=(9x3) + (10x2) + (1x1)=48
4. P=D/Max PA= 34/48=.71
24
COMPARISON TO DUC COMPARISON TO DUC SCORES:SCORES:
HUMAN SUMMARIESHUMAN SUMMARIES
Lockerbie A B C D DUC n.a. .82 .54 .74 Pyramid .71 .82 .71 .81 PAL A H I J DUC .30 n.a. .30 .10 Pyramid .76 .72 .60 .45 China C D D F DUC n.a. .28 .27 .13 Pyramid .52 .65 .73 .62
25
MACHINE SUMMARY MACHINE SUMMARY EXAMPLEEXAMPLE
African countries voted in June to ignore the U.N. flight ban which was imposed in 1992 to try and force Libya to hand over for trial two suspects wanted in the 1988 bombing of an American airliner over Lockerbie, Scotland. The reported jailing of the three officials comes as Gadhafi is under pressure to accept a plan to turn over for trial two other Libyans wanted for the 1988 bombing of Pan am flight 103 over Lockerbie, Scotland, that led to 270 deaths. The visit was Farrakhan's …
26
COMPARISON TO DUC COMPARISON TO DUC SCORES:SCORES:
MACHINE SUMMARIESMACHINE SUMMARIESSYSTEM DUC PYRAMID
Sys06* .30 .79
Sys13 .03 .24
Sys14 .25 .51
Sys16* .25 .26
Sys17* .03 .17
Sys18 .03 .20
Sys20 .10 .64
27
MACHINE SUMMARIESMACHINE SUMMARIES
System 6
PAL, Asia’s oldest airline, has been unable to make payments on dlrs 2.1 billion debt after being devasted by a pilot’s strike and by Asia’s currency crisis. PAL earlier accepted a preliminary investment offer from Cathay Pacific, Ailing Philippine Airlines and prospective investor Cathy Pacific Airways have clashed over . . .
28
MACHINE SUMMARIESMACHINE SUMMARIES
System 16
President Joseph Estrada on Saturday urged militant unionists at Philippine Airlines to accept a vote by workers approving a 10-year no-strike deal to revive the debt-laden airline. President Joseph Estrada said Saturday the financially troubled airlines will resume its international flights on Sunday by flying him to Singapore . . .
29
MACHINE SUMMARIESMACHINE SUMMARIES
System 17
Christmas is a sacred holiday in the Philippines, and nowhere is that more evident than at the headquarters of Philippine Airlines. But Ramos, who was intent on privatizing the economy, opened the industry to competition, licensing rivals like Air Philippines, Cebu Pacific, and Grand Air. PAL closed for nearly 2 weeks on Sep. 23 after . . .
30
OPEN ISSUESOPEN ISSUES
Distribution of SCUs NOT an independent variableOrderingKnowledgeInformational Goal
Can Pyramid Scoring be Automated?
31
SCU INTERDEPENDENCIESSCU INTERDEPENDENCIES
1. SCU4 presupposes SCU1:
SCU1 (w=4): PAL has a debt > 2 billion
SCU4 (w=3): PAL cannot make its debt payments
2. SCU7, SCU8 depend on SCU2
SCU2 (w=4): PAL shutdown operations
SCU7 (w=3): shutdown began on 9/23
SCU8 (w=3): shutdown lasted 2 weeks
32
SCUs and DEPENDENCY/TAG SCUs and DEPENDENCY/TAG GRGRA3
[On September 23]7
[PAL shut down]2
[after the ground crew union turned down a
settlement]18
[which it accepted two weeks later.]19 SCU71 On IN 5 shut t02 September NNP 4 PAL t2 3 23 CD 4 PAL t2
33
““LARGE” CONSTITUENTSLARGE” CONSTITUENTS
1. PAL experienced a crisis in 1998.
2. Unable to make payments on a $2.1 billion debt,
3. it was faced by a pilot's strike in June
4. and the region's currency problems
5. which reduced passenger numbers and inflated costs.
6. On September 23 pal shut down
7. after the ground crew union turned down a settlement
8. which it accepted two weeks later.
9. PAL resumed domestic flights on October 7
10. and [resumed] international flights on October 26.
11. Resolution of the basic financial problems was elusive,
however,
12. and as of December 18 pal was still $2.2 billion in
debt
13. and [pal was] losing close to $1 million a day.
34
DOCSET TF*IDF DOCSET TF*IDF
TERMS: $2, airline, billion, day, debt, pal (6 of 13 LCs) 1 1. Philippine Airlines (pal) experienced a crisis in
1998.SCU3 w=3
3 2. Unable to make payments on a $2.1 billion debt,SCU1 w=4
1 6. On September 23 pal shut downSCU2 w=4 & SCU7 w=3
1 9. pal resumed domestic flights on October 7SCU10 w=2
4 12. and as of December 18 pal was still $2.2 billion in debtNO SCU
1 13. and losing close to $1 million a day.SCU15 w=2
35
CONCLUSIONSCONCLUSIONS
Define parameters of the problemo What is summarization?
Compare systems and/or humanso Is the metric meaningful?
Track progresso When does output improve?
Cost Effectivenesso Can it be (partly) automated?