text specificity and impact on quality of news summaries annie louis & ani nenkova university of...

Post on 19-Dec-2015

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text Specificity and Impact on Quality of

News Summaries

Annie Louis & Ani NenkovaUniversity of Pennsylvania

June 24, 2011

Texts are a mix of general and specific sentences

Recently, we have developed a classifier that can distinguish general vs. specific sentences

The notion of specificity could be useful for a number of applications In this work, we consider automatic summarization Summaries cannot include all specific content because

of the space constraint Understand the role of general/specific content in

summaries and how it impacts quality

Specificity: amount of detail

2

Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later. [overview]

The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said.

[details]

Example general and specific sentences

3

Prior studies of general-specific content in summaries Humans use generalization and specification of

source sentences to create abstract sentences

One generation task is to fuse information from key (general) sentence and specific sentence on the same topic to create an abstract sentence

Subtitles of news broadcasts are often generalized compared to original text

4

[Jing & McKeown (2000)]

[Wan et al. (2008)]

[Marsi et al. (2010)]

Overview of our study Quantitative analysis of specificity in inputs and

summaries using a general/specific classifier

1. Human abstracts have much more general content than system extracts

2. Amount of specific content is related to content quality of system summaries More general ~ better

3. Preliminary study on properties of summary-worthy general sentences 5

Data: DUC 2002 Generic multidocument summarization task

59 input sets 5 to 15 news documents

3 types of summaries 200 words Manually assigned content and linguistic quality scores

1. Humanabstracts

6

2. Humanextracts

3. Systemextracts

2 assessors * 59 2 assessors * 59 9 systems * 59

General vs. specific sentence classifier: prior work

7

Sentence level

Features1. Words2. Named entities, numbers3. Likelihood under language model4. Word specificity5. Adjectives/adverbs, length of phrases6. Polar words7. Sentence length

Training Binary: General or specific Logistic regression: can get probability for a class

[Louis & Nenkova (2011)]

Classification performance

8

75% accurate Validated on human annotations On examples with high annotator agreement – 90%

The probability is indicative of annotator agreement on class Sentences with high agreement ~ high confidence

predictions

Computing specificity for a text Sentences in summary are of varying length, so

we compute a score on word level “Average specificity of words in the text”

9

S1: w12w11 …w13

S2: w22w21 …w23

S3: w32w31 …w33

Confidence for beingin specific class

0.23

0.81

0.680.68 0.68 0.68 0.68

0.23 0.23 0.23 0.23

0.81 0.81 0.81 0.81

Average score on tokens

Specificity score

Average specificity of different types of summaries

1. More general content is preferred in abstracts

2. Simply the process of extraction makes summaries more specific

3. System summaries are overly specific

10

0.7 0.80.6Inputs (0.65)

H. Abs (0.62)

S.ext (0.74)

H.ext (0.72)

specific

Is the difference related to summary quality?

general

Analysis of ‘system summaries’: specificity and quality

1. Content quality Importance of content included in the summary More general ~ better

2. Linguistic quality How well-written the summary is perceived to be More specific ~ better

3. Quality of general/specific summaries When a summary is intended to be general or specific

11

1. Specificity and content quality Coverage score: manually judged at NIST

Similarity to a human summary

Correlation with specificity -0.169 (p-value 0.0006)

More specific ~ decreased content quality

12

But the correlation is not very high Specificity is related to realization of content

Different from importance of the content

Content quality = content importance + appropriate specificity level

Content importance: ROUGE scores N-gram overlap of system summary and human summary Standard evaluation of automatic summaries

13

System summary quality: Specificity as one of the predictors Coverage score ~ ROUGE-2 (bigrams) + specificity

Linear regression

Weights for predictors in the regression model

14

Mean β Significance (hypothesis β = 0)

(Intercept) 0.212 2.3e-11

ROUGE-2 1.299 < 2.0e-16

Specificity -0.166 3.1e-05

Is the combination a better predictor than ROUGE alone?

2. Specificity and linguistic quality Used different data: TAC 2009

DUC 2002 only reported number of errors Were also specified as a range: 1-5 errors

TAC 2009 linguistic quality score Manually judged: scale 1 – 10 Combines different aspects

coherence, referential clarity, grammaticality, redundancy

15

System summaries: What is the avg specificity in different score categories?

More general ~ lower score! General content

is useful but need proper context!

16

Ling score No. summaries

Poor (1, 2) 202

Mediocre (5) 400

Best (9, 10) 79

If a summary starts as follows:“We are quite a ways from that, actually.”As ice and snow at the poles melt, …

Specificity = lowLinguistic quality = 1

Average specificity

0.71

0.72

0.77

3. Specificity and quality of general/specific summaries DUC 2005: General-specific summary task

Create general summaries for some inputs, specific summaries for others

How specificity is related to scores of these summaries?

17

System summaries: Correlation between specificity and content scores

Further hints that specificity alone is not predictive of summary quality Once a summary is general, level of generality is not

longer predictive of quality

18

Summary type

Pearson correlation

General -0.03

Specific 0.18*

Content scores were measured using the pyramid method

Analysis of general sentences in human summaries1. Generalization operation performed in human

abstracts Frequency of operations, amount of deletions

2. How general sentences are used in human extracts Position, type of sentence

19

Data for analysing generalization operation Aligned pairs of abstract and source sentences

conveying the same content Traditional data used for compression experiments

Ziff-Davis corpus 15964 sentence pairs used in Galley & McKeown, 2007 Any number of deletions, up to 7 substitutions

Only 25% abstract sentences are mapped But beneficial to observe the trends

20

Generalization operation in human abstracts

Transition

SS

SG

GG

GS

21

One-third of all transformations are specific to general

Human abstracts involve a lot of generalization

No. pairs % pairs

6371 39.9

5679 35.6

3562 22.3

352 2.2

How specific sentences get converted to general?

SG

SS

GG

GS

22

Orig. length

33.5

33.4

21.5

22.7

New/orig length

40.8

56.6

60.8

66.0

Avg. deletions(words)

21.4

16.3

9.3

8.4

Choose long sentences and compress heavily!

A measure of generality would be useful to guide compression Currently only importance and grammaticality are used

Use of general sentences in human extracts Details of Maxwell’s death were sketchy. Folksy was an understatement. “Long live democracy!” Instead it sank like the Bismarck.

Example use of a general sentence in a summary…With Tower’s qualifications for the job, the nominations

should have sailed through with flying colors. [Specific]Instead it sank like the Bismarck. [General]……

Simple categorization 75 top general sentences according to classifier

confidence

24

Type

First sentence

Last sentence

Attributions

Comparisons

General sentences are used as topic/ emphasis sentences

Proportion

6 (8%)

13 (17%)

14 (18%)

4 (5%)

Conclusion General sentences are useful content for

summaries People use them in summaries for emphasis and topic

They can improve the content quality Choosing good general sentences or generating them

will be an interesting task

But linguistic quality should also be considered General sentences difficult to understand out of context Content planning should consider the order of general

content

Thank you

26

Histogram of specificity scores

top related