automatic text summarization

1

Automatic Text Summarization

By : asef poormasoomiautumn 2009

2

Introduction

summary: brief but accurate representation of the contents of a document

3

Motivation

• Abstracts for Scientific and other articles• News summarization (mostly Multiple document

summarization)• Classification of articles and other written data• Web pages for search engines• Web access from PDAs, Cell phones• Question answering and data gathering

4

• Extract vs. abstract• lists fragments of text vs. re-phrases content coherently.

• example : He ate banana, orange and apple=> He ate fruit• Generic vs. query-oriented

• provides author’s view vs. reflects user’s interest.• example : question answering system

• Personal vs. general• consider reader’s prior knowledge vs. general.

• Single-document vs. multi-document source• based on one text vs. fuses together many texts.

• Indicative vs. informative• used for quick categorization vs. content processing.

Genres

5

Summarization In 3 steps (Lin and Hovy -1997)

Content/Topic Identification• goal : find/extract the most important material.• techniques : methods based on position, cue phrases, concept counting,

word frequency. Conceptual/Topic Interpretation

• application : just for abstract summaries • methods : merging or fusing related topics into more general ones,

removing redundancies, etc.• example:

• He sat down, read the menu, ordered, ate and left => He visited the restaurant.

Summary Generation: • say it in your own words• Simple if extraction if preformed

6

Methods

Statistical scoring methods (Pseudo) Higher semantic/syntactic structures

• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-

reference chains) AI methods

7

Statistical scoring (Pseudo)

General method: 1. score each entity (sentence, word) ;2. combine scores; 3. choose best sentence(s)

Scoring tecahniques:• Word frequencies throughout the text (Luhn 58)• Position in the text (Edmunson 69, Lin&Hovy 97)• Title method (Edmunson 69)• Cue phrases in sentences (Edmunson 69) • Bayesian Classifier (Kupiec at el 95)

8

Word frequencies (Luhn 58)

Very first work in automated summarization Claim: words which are frequent in a document indicate the

topic discussed• Frequent words indicate the topic• Clusters of frequent words indicate summarizing sentence

Stemming should be used “stop words” (i.e.”the”, “a”, “for”, “is”) are ignord

9

Word frequencies (Luhn 58)

Calculate term frequency in document: f(term) Calculate inverse log-frequency in corpus : if(term)

Words with high f(term)if(term) are indicative Sentence with highest sum of weights is chosen

)),(

)(log()(corpustermn

corpusntermif

10

Claim : Important sentences occur in specific positions Position depends on type(genre) of text

• inverse of position in document works well for the “news” Important information occurs in specific sections of the

document (introduction/conclusion) Assign score to sentences according to location in paragraph Assign score to paragraphs and sentences according to

location in entire text

Position in the text(Edmunson 69, Lin&Hovy 97)

11

Claim : title of document indicates its content (Duh!) words in title help find relevant content create a list of title words, remove “stop words” Use those as keywords in order to find important sentences

Title method (Edmunson 69)

12

Cue phrases method (Edmunson 69)

Claim : Important sentences contain cue words/indicative phrases• “The main aim of the present paper is to describe…” (IND)• “The purpose of this article is to review…” (IND)• “In this report, we outline…” (IND)• “Our investigation has shown that…” (INF)

Some words are considered bonus others stigma• bonus: comparatives, superlatives, conclusive expressions, etc.• stigma: negatives, pronouns, etc.

Implemented for French (Lehman ‘97)• Paice implemented a dictionary of <cue,weight>• Grammar for indicative expressions

In + skip(0) + this + skip(2) + paper + skip(0) + we + ... Cue words can be learned (Teufel’98)

13

Feature combination (Edmundson ’69)

Linear contribution of 4 features• title, cue, keyword, position• the weights are adjusted using training data with any

minimization technique

The following results were obtained• best system

• cue + title + position

14

Uses Bayesian classifier:

• Assuming statistical independence:

k

j j

k

j jk

FP

SsPSsFPFFFSsP

1

121

)(

)()|(),...,|(

),()()|,...,(),...,|(

,...21

2121

k

kk FFFP

SsPSsFFFPFFFSsP

Bayesian Classifier (Kupiec at el 95)

• Higher probability sentences are chosed to be in the summary• Performance:

• For 25% summaries, 84% precision

15

Methods

Statistical scoring methods• problems :

• Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle.

• Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle.

• Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997)

Higher semantic/syntactic structures• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-reference chains)

AI methods

16

Higher semantic/syntactic structures

Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures.

Classes of approaches• lexical similarity (WordNet, lexical chains);• word co-occurrences; • co-reference;• combinations of the above.

17

• lexical cohesion : (Hasan , Halliday)• reiteration

• synonym• antonym• hyperonym

• collocation• co occurance

• example : کار می کندمدرسه در معلماو به عنوان • Lexical chain :

• Sequence of words which have lexical cohesion(Reiteration/Collocation)

Lexical chain

18

• Method for creating chain:• Select a set of candidate words from the text.• For each of the candidate words, find an appropriate chain, relying on a relatedness

criterion among members of the chains and the candidate words.• If such a chain is found, insert the word in this chain and update it accordingly; else

create a new chain.• Scoring the chains :

• synonym =10, antonym=7, hyponym=4• Strong chain must select• Sentence selection for summary

• H1: select the first sentence that contains a member of a strong chain • example : Chain: AI=2 ; Artificial Intelligence =1 ; Field=7 ; Technology=1 ; Science=1

• H2: select the first sentence that contains a “representative” (frequency) member of the chain

• H3: identify a text segment where the chain is highly dense (density is the proportion of words in the segment that belong to the chain)

Lexical chain

19

• Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achineve much closer monitoring of the pump feeding the anesthetic into the patient.

Lexical chain

20

Network based method (Salton&al’97)

Vector Space Model • each text unit represented as vector

Standard similarity metric

Construct a graph of paragraphs or other entities. Strength of link is the similarity metric Use threshold to decide upon similar paragraphs or entities (pruning of the graph) paragraph selection heuristics

• bushy path select paragraphs with many connections with other paragraphs and present them in

text order• depth-first path

select one paragraph with many connections; select a connected paragraph (in text order) which is also well connected; continue

),...,( 1 iniiddD

jkikji ddDDsim .),(

21

Text relation map

CA

B

D

EF

C=2A=3

B=1

D=1

E=3F=2

sim>thr

sim<thr

similarities

links based on

thr

22

User oriented text summarization

23

Motivation

summaries which are generic in nature do not cater to the user’s background and interests

results show that each person has different perspective on the same text Marcu-1997: found percent agreement of 13 judges over 5

texts from scientific America is 71 percent.

Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap

Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap

24

Data Click: when a user clicks on a document, the document is considered to be of

more interest to the user than other unclicked ones Query History:

is the most widely used implicit user feedback at present. example : http://www.google.com/psearch

Attention Time : often referred to as display time or reading time

Other types of implicit user feedbacks : Other types of implicit user feedbacks include, scrolling, annotation,

bookmarking and printing behaviors

Users Feedback

25

Summarization Using Data click

use extra knowledge of the clickthrough data to improve Web-page summarization

collection of clickthrough data, can be represented by a set of triples < u; q; p >

Typically, a user's query words , reflect the true meaning of the target Web-page content

Problems : incomplete click problem noisy data click

26

Attention Time

MAIN IDEA The key idea is to rely on the attention (reading) time of individual users spent

on single words in a document. The prediction of user attention over every word in a document is based on the

user’s attention during his previous reads algorithm tracks a user’s attention times over individual words using a vision-

based commodity eye-tracking mechanism. use simple web camera and an existent eye-tracking algorithm “Opengazer

project” The error of the detected gaze location on the screen is between 1–2 cm,

depending which area of the screen the user is looking at (a 19” screen monitor).

27

Attention Time

Anchoring Gaze Samples onto Individual Words the detected gaze central point is positioned at (x; y) on the screen space compute the central displaying point of the word which is denoted as (xi; yi).

For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner.

The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process

During processing, remove the stop words.

28

attention time prediction for a word is based on the semantic similarity of two words.

for an arbitrary word w which is not among , calculate the similarity between w and every wi(i = 1,…, n)

select k words which share the highest semantic similarity with w.

Predicting User Attention for Sentences

Attention Time

29

Attention Time

30

Attention Time

31

Other types of implicit user feedbacks

extract the personal information of the user using information available on the web

put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”)

’n’ top documents are taken and retrieved. After performing the removal of stop words and stemming, a unigram

language model is learned on the extracted text content. User Specific Sentence Scoring :

sentence score :

32

Example Topic of summary generation is ”Microsoft to open research lab in India” 8 articles published in different new sources forms the news cluster User A is from NLP domain and User B from network security domain.

Generic summary: The New Lab, Called Microsoft Research India,

Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company


33

User A Specific summary :

The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition. In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab


34

User B Specific summary :

The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty


35

FarsiSumA Persian text summarizer

By : Nima Mazdak , Martin Hassel Department of Linguistics Stockholm University 2004

36

FarsiSum

Tokenizer: Sentence boundaries are found by searching for periods, exclamations , question marks and <BR> (the HTML new line) and the Persian question mark (؟) , “.”, “,”, “!”, “?”, “<”, “>”, “:”, spaces, tabs and new lines

Sentence Scoring: Text lines are put into a data structure16 for storing key/value called text table value

37

FarsiSum

Sentence Scoring: Word score = (word frequency) * (a keyword constant) Sentence Score = Σ word score (for all words in the current sentence)

average sentence length (ASL) Average sentence length (ASL) = Word-count / Line-count Sentence score = (ASL * Sentence Score)/ (nr of words in the current sentence)

38

Notes on the Current Implementation : Word Boundary Ambiguity :

stop (.) marks a sentence boundary, but it may also appear in the formation of abbreviations or acronyms.

Compound words and light verb constructions may also appear with or without a space .

Ambiguity in morphology

Word Order : The canonical word order in Persian is SOV, but Persian is a free word order language

Possessive Construction

FarsiSum

39

thanks

automatic text summarization

Documents

title of document

text edmunson

sentences edmunson

text luhn

list of title words

cue phrases method edmunson

word frequency

ifterm words