automatic text summarization
DESCRIPTION
By : asef poormasoomi autumn 2009. Automatic Text Summarization. Introduction. summary : brief but accurate representation of the contents of a document. Motivation. Abstracts for Scientific and other articles News summarization (mostly Multiple document summarization) - PowerPoint PPT PresentationTRANSCRIPT
1
Automatic Text Summarization
By : asef poormasoomiautumn 2009
2
Introduction
summary: brief but accurate representation of the contents of a document
3
Motivation
• Abstracts for Scientific and other articles• News summarization (mostly Multiple document
summarization)• Classification of articles and other written data• Web pages for search engines• Web access from PDAs, Cell phones• Question answering and data gathering
4
• Extract vs. abstract• lists fragments of text vs. re-phrases content coherently.
• example : He ate banana, orange and apple=> He ate fruit• Generic vs. query-oriented
• provides author’s view vs. reflects user’s interest.• example : question answering system
• Personal vs. general• consider reader’s prior knowledge vs. general.
• Single-document vs. multi-document source• based on one text vs. fuses together many texts.
• Indicative vs. informative• used for quick categorization vs. content processing.
Genres
5
Summarization In 3 steps (Lin and Hovy -1997)
Content/Topic Identification• goal : find/extract the most important material.• techniques : methods based on position, cue phrases, concept counting,
word frequency. Conceptual/Topic Interpretation
• application : just for abstract summaries • methods : merging or fusing related topics into more general ones,
removing redundancies, etc.• example:
• He sat down, read the menu, ordered, ate and left => He visited the restaurant.
Summary Generation: • say it in your own words• Simple if extraction if preformed
6
Methods
Statistical scoring methods (Pseudo) Higher semantic/syntactic structures
• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-
reference chains) AI methods
7
Statistical scoring (Pseudo)
General method: 1. score each entity (sentence, word) ;2. combine scores; 3. choose best sentence(s)
Scoring tecahniques:• Word frequencies throughout the text (Luhn 58)• Position in the text (Edmunson 69, Lin&Hovy 97)• Title method (Edmunson 69)• Cue phrases in sentences (Edmunson 69) • Bayesian Classifier (Kupiec at el 95)
8
Word frequencies (Luhn 58)
Very first work in automated summarization Claim: words which are frequent in a document indicate the
topic discussed• Frequent words indicate the topic• Clusters of frequent words indicate summarizing sentence
Stemming should be used “stop words” (i.e.”the”, “a”, “for”, “is”) are ignord
9
Word frequencies (Luhn 58)
Calculate term frequency in document: f(term) Calculate inverse log-frequency in corpus : if(term)
Words with high f(term)if(term) are indicative Sentence with highest sum of weights is chosen
)),(
)(log()(corpustermn
corpusntermif
10
Claim : Important sentences occur in specific positions Position depends on type(genre) of text
• inverse of position in document works well for the “news” Important information occurs in specific sections of the
document (introduction/conclusion) Assign score to sentences according to location in paragraph Assign score to paragraphs and sentences according to
location in entire text
Position in the text(Edmunson 69, Lin&Hovy 97)
11
Claim : title of document indicates its content (Duh!) words in title help find relevant content create a list of title words, remove “stop words” Use those as keywords in order to find important sentences
Title method (Edmunson 69)
12
Cue phrases method (Edmunson 69)
Claim : Important sentences contain cue words/indicative phrases• “The main aim of the present paper is to describe…” (IND)• “The purpose of this article is to review…” (IND)• “In this report, we outline…” (IND)• “Our investigation has shown that…” (INF)
Some words are considered bonus others stigma• bonus: comparatives, superlatives, conclusive expressions, etc.• stigma: negatives, pronouns, etc.
Implemented for French (Lehman ‘97)• Paice implemented a dictionary of <cue,weight>• Grammar for indicative expressions
In + skip(0) + this + skip(2) + paper + skip(0) + we + ... Cue words can be learned (Teufel’98)
13
Feature combination (Edmundson ’69)
Linear contribution of 4 features• title, cue, keyword, position• the weights are adjusted using training data with any
minimization technique
The following results were obtained• best system
• cue + title + position
14
Uses Bayesian classifier:
• Assuming statistical independence:
k
j j
k
j jk
FP
SsPSsFPFFFSsP
1
121
)(
)()|(),...,|(
),()()|,...,(),...,|(
,...21
2121
k
kk FFFP
SsPSsFFFPFFFSsP
Bayesian Classifier (Kupiec at el 95)
• Higher probability sentences are chosed to be in the summary• Performance:
• For 25% summaries, 84% precision
15
Methods
Statistical scoring methods• problems :
• Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle.
• Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle.
• Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997)
Higher semantic/syntactic structures• Network (graph) based methods• Other methods (rhetorical analysis, lexical chains, co-reference chains)
AI methods
16
Higher semantic/syntactic structures
Claim: Important sentences/paragraphs are the highest connected entities in more or less elaborate semantic structures.
Classes of approaches• lexical similarity (WordNet, lexical chains);• word co-occurrences; • co-reference;• combinations of the above.
17
• lexical cohesion : (Hasan , Halliday)• reiteration
• synonym• antonym• hyperonym
• collocation• co occurance
• example : کار می کندمدرسه در معلماو به عنوان • Lexical chain :
• Sequence of words which have lexical cohesion(Reiteration/Collocation)
Lexical chain
18
• Method for creating chain:• Select a set of candidate words from the text.• For each of the candidate words, find an appropriate chain, relying on a relatedness
criterion among members of the chains and the candidate words.• If such a chain is found, insert the word in this chain and update it accordingly; else
create a new chain.• Scoring the chains :
• synonym =10, antonym=7, hyponym=4• Strong chain must select• Sentence selection for summary
• H1: select the first sentence that contains a member of a strong chain • example : Chain: AI=2 ; Artificial Intelligence =1 ; Field=7 ; Technology=1 ; Science=1
• H2: select the first sentence that contains a “representative” (frequency) member of the chain
• H3: identify a text segment where the chain is highly dense (density is the proportion of words in the segment that belong to the chain)
Lexical chain
19
• Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achineve much closer monitoring of the pump feeding the anesthetic into the patient.
Lexical chain
20
Network based method (Salton&al’97)
Vector Space Model • each text unit represented as vector
Standard similarity metric
Construct a graph of paragraphs or other entities. Strength of link is the similarity metric Use threshold to decide upon similar paragraphs or entities (pruning of the graph) paragraph selection heuristics
• bushy path select paragraphs with many connections with other paragraphs and present them in
text order• depth-first path
select one paragraph with many connections; select a connected paragraph (in text order) which is also well connected; continue
),...,( 1 iniiddD
jkikji ddDDsim .),(
21
Text relation map
CA
B
D
EF
C=2A=3
B=1
D=1
E=3F=2
sim>thr
sim<thr
similarities
links based on
thr
22
User oriented text summarization
23
Motivation
summaries which are generic in nature do not cater to the user’s background and interests
results show that each person has different perspective on the same text Marcu-1997: found percent agreement of 13 judges over 5
texts from scientific America is 71 percent.
Rath-1961 : found that extracts selected by four different human judges had only 25 percent overlap
Salton-1997 : found that most important 20 paragraphs extracted by 2 subjects have only 46 percent overlap
24
Data Click: when a user clicks on a document, the document is considered to be of
more interest to the user than other unclicked ones Query History:
is the most widely used implicit user feedback at present. example : http://www.google.com/psearch
Attention Time : often referred to as display time or reading time
Other types of implicit user feedbacks : Other types of implicit user feedbacks include, scrolling, annotation,
bookmarking and printing behaviors
Users Feedback
25
Summarization Using Data click
use extra knowledge of the clickthrough data to improve Web-page summarization
collection of clickthrough data, can be represented by a set of triples < u; q; p >
Typically, a user's query words , reflect the true meaning of the target Web-page content
Problems : incomplete click problem noisy data click
26
Attention Time
MAIN IDEA The key idea is to rely on the attention (reading) time of individual users spent
on single words in a document. The prediction of user attention over every word in a document is based on the
user’s attention during his previous reads algorithm tracks a user’s attention times over individual words using a vision-
based commodity eye-tracking mechanism. use simple web camera and an existent eye-tracking algorithm “Opengazer
project” The error of the detected gaze location on the screen is between 1–2 cm,
depending which area of the screen the user is looking at (a 19” screen monitor).
27
Attention Time
Anchoring Gaze Samples onto Individual Words the detected gaze central point is positioned at (x; y) on the screen space compute the central displaying point of the word which is denoted as (xi; yi).
For each gaze detected by eye-tracking module, assign the gaze samples to the words in the document in this manner.
The overall attention that a word in the document receives is the sum of all the fractional gaze samples it is assigned in the above process
During processing, remove the stop words.
28
attention time prediction for a word is based on the semantic similarity of two words.
for an arbitrary word w which is not among , calculate the similarity between w and every wi(i = 1,…, n)
select k words which share the highest semantic similarity with w.
Predicting User Attention for Sentences
Attention Time
29
Attention Time
30
Attention Time
31
Other types of implicit user feedbacks
extract the personal information of the user using information available on the web
put the person’s full name to a search engine (name is quoted with double quotation such as ”Albert Einstein”)
’n’ top documents are taken and retrieved. After performing the removal of stop words and stemming, a unigram
language model is learned on the extracted text content. User Specific Sentence Scoring :
sentence score :
32
Example Topic of summary generation is ”Microsoft to open research lab in India” 8 articles published in different new sources forms the news cluster User A is from NLP domain and User B from network security domain.
Generic summary: The New Lab, Called Microsoft Research India,
Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India. Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States . In Line With Microsoft’s Research Strategy Worldwide , The Bangalore Lab Will Collaborate With And Fund Research At Key Educational Institutions In India, Such As The Indian Institutes Of Technology, Anandan Said . Although Microsoft Research Doesn’t Engage In Product Development Itself, Technologies Researchers Create Can Make Their Way Into The Products The Company
Other types of implicit user feedbacks
33
User A Specific summary :
The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan, Managing Director Of Microsoft Research India.Microsoft’s Mission India, Formally Inaugurated Jan. 12, 2005, Is Microsoft’s Third Basic Research Facility Established Outside The United States. Microsoft Will Collaborate With The Government Of India And The Indian Scientific Community To Conduct Research In Indic Language Computing Technologies, This Will Include Areas Such As Machine Translation Between Indian Languages And English, Search And Browsing And Character Recognition. In Line With Microsoft’s Research Strategy Worldwide,The Bangalore Lab
Other types of implicit user feedbacks
34
User B Specific summary :
The New Lab, Called Microsoft Research India, Goes Online In January, And Will Be Part Of A Network Of Five Research Labs That Microsoft Runs Worldwide, Said Padmanabhan Anandan , Managing Director Of Microsoft Research India. The Newly Announced India Research Group Focuses On Cryptography, Security, Algorithms And Multimedia Security, Ramarathnam Venkatesan, A Leading Cryptographer At Microsoft Research In Redmond, Washington, In The US, Will Head The New Group. Microsoft Research India will conduct a four-week summer school featuring lectures by leading experts in the fields of cryptography, algorithms and security. The program is aimed at senior undergraduate students, graduate students and faculty
Other types of implicit user feedbacks
35
FarsiSumA Persian text summarizer
By : Nima Mazdak , Martin Hassel Department of Linguistics Stockholm University 2004
36
FarsiSum
Tokenizer: Sentence boundaries are found by searching for periods, exclamations , question marks and <BR> (the HTML new line) and the Persian question mark (؟) , “.”, “,”, “!”, “?”, “<”, “>”, “:”, spaces, tabs and new lines
Sentence Scoring: Text lines are put into a data structure16 for storing key/value called text table value
37
FarsiSum
Sentence Scoring: Word score = (word frequency) * (a keyword constant) Sentence Score = Σ word score (for all words in the current sentence)
average sentence length (ASL) Average sentence length (ASL) = Word-count / Line-count Sentence score = (ASL * Sentence Score)/ (nr of words in the current sentence)
38
Notes on the Current Implementation : Word Boundary Ambiguity :
stop (.) marks a sentence boundary, but it may also appear in the formation of abbreviations or acronyms.
Compound words and light verb constructions may also appear with or without a space .
Ambiguity in morphology
Word Order : The canonical word order in Persian is SOV, but Persian is a free word order language
Possessive Construction
FarsiSum
39
thanks