automatic summarizaton tutorial
TRANSCRIPT
Образец заголовка
Tutorial on Automatic Summarizationby Shilpa Subrahmanyam
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
Образец заголовкаThe World Today• We live in an age in which a massive amount
of content is available at our fingertips– 500 million tweets are posted every day– 2 million + articles are posted daily– The average article length at the New York
Times is about 1,200 words. • Automatic summarization can prove
extremely useful in attempting to generate insights and themes from such large corpuses of data.
Образец заголовкаWhat is Automatic Summarization?
• Definition: Employing an algorithm to distill a text corpus to a considerably smaller body of important ideas, sentences, phrases, etc.
• Key Challenges: – Determining how to rank the importance of a
sentence, word, or phrase. – Eliminating and capitalizing on redundancy – Incorporating sentiment into the summary – Ensuring the summary is readable – Avoiding a search through an exponential
solution space
Образец заголовкаTypes of Automatic Summarization
• Extractive Summarization – Select a subset of existing words, phrases,
or sentences in the original text in order to generate a summary.
• Abstractive Summarization – Aims to create a summary that is closer to
what a human might generate.– Phrases in the summary don’t necessarily
need to have appeared in the original text• Keyword/Key-phrase extraction
Образец заголовкаApplications and Use Cases• Summarizing tweets in order to
determine the timeline of a sports game
• Product owners summarizing highly redundant product reviews (in order to get popular opinions and key insights)
• Distilling a long article to a set of key points.
Образец заголовкаSummarization of Small
Data (i.e. Tweets, microblogs, etc.)
• These are summarization methods that are targeted at Tweets, microblogs, and other content-limited data.
• A lot of work has been done recently in this area because of the increasing availability of large amounts of content-limited small data samples that are readily available via the advent of Twitter and similar sites that value smaller sized content.
Образец заголовкаBasic Approaches • Topical Keyphrase Extraction
from Twitter (Zhao, et al.)• Summarizing Microblogs
Automatically (Sharifi, et al.)
Образец заголовкаTopical Keyphrase Extraction from Twitter (Xin Zhao, Jiang, He, Song,
Achananuparp, Lim, Li)
• The method proposed by the authors is a context-sensitive topical PageRank method for keyword ranking.
• This is paired with a probabilistic scoring function that considers two factors of key phrases when doing key-phrase ranking:– relevance– interestingness
Key idea: Generate a list of topical key phrases that will serve as a summarization of a corpus of tweets.
Образец заголовкаSummarizing Microblogs
Automatically (Sharifi, Hutton, Kalita)
• Start with a topic or phrase and generate tweets that are related to that topic or phrase.
• Isolate the longest sentence in each tweet that contains the topic phrase. We use this set of sentences as the input to our algorithm.
• Build a graph representing the common sequences of words (the common phrases) that occur before and after the key topic phrase. – This root node represents the topic phrase. – Each word is represented by a node and a count that
indicates how many times the word occurs within the set of input sequences. A phrase is represented in the graph by a sequence of nodes starting with the root.
Key Idea: take a trending phrase, collect a huge number of tweets containing that trending phrase, and provide an automatically generated
summary of the tweets that were collected.
Образец заголовкаSummarizing Microblogs
Automatically (Sharifi, Hutton, Kalita)
• Assign each node to a weight. This is in order to prevent longer phrases from dominating the output. – Words are given weights that are proportional to their count.
• Construct “partial summary” by searching the graph for a path with the largest total weight (it searches all paths that begin with the root node and end with a non-root node). – This path represents the most common phrase occurring
either before or after the topic phrase. • Run algorithm once more. This time, we need to
initialize the root node with the partial summary and rebuild the graph. This time around, the most heavily weighted path from the new graph is the final summary produced by the algorithm
Образец заголовкаWhy are these approaches suboptimal?
• Topical Key phrase Extraction from Twitter – Key phrase extraction can often produce noisy results (i.e. key
phrases that are common but don’t help to identify themes). – Moreover, oftentimes, it may not be sufficient for summative
purposes to simply look through a list of key phrases. • More detail may be required for a sufficient grasp of the original text.
• Summarizing Microblogs Automatically– This approach is predicated on the fact that we specifically
retrieve tweets that all pertain to the same trending phrase as an input to our algorithm.
– Extracts only the longest sentence that contains the topic keyword(s) from each tweet as an input to the graph algorithm. • The equation of length of a sentence to importance could prove
fallacious. Furthermore, this could mean discarding valuable information before the algorithm even begins.
Образец заголовкаWhat are some more advanced approaches?
• Twitter Topic Summarization by Ranking Tweets Using Social Influence and Content Quality (YaJuan, et al.)
• Summarizing Sporting Events Using Twitter (Nichols, et al.)
• Sumblr: continuous summarization of evolving tweet streams (Shou, et al.)
Образец заголовкаTwitter Topic Summarization by
Ranking Tweets Using Social Influence and Content Quality (YaJuan, ZhuMin, FuRu, Ming,
Heung − Yeung)
• This approach takes advantage of follower-followee relationships on Twitter -- which is the main manner in which social influence of users is inferred. The quality of tweets is judged based a few factors that are incorporated into the graph-based ranking algorithm: – readability– content richness– a measure of the regularity of written language– pointless degree of the content.
• In order to curb redundancy within the final summary, the model selects tweets from the ranking results using a Maximal Marginal Relevance algorithm (Carbonell and Goldstein, 1998).
Key idea: Algorithm models and formulates the ranking of tweets in a unified mutual reinforcement graph.
Образец заголовкаTwitter Topic Summarization by
Ranking Tweets Using Social Influence and Content Quality (YaJuan, ZhuMin, FuRu, Ming,
Heung − Yeung)• Algorithm models the problem of tweet ranking in a unified mutual reinforcement graph. – In this model, social influence of users
and a measure of the quality of the tweet content are both taken into consideration (in a simultaneously mutually reinforcing manner).
Образец заголовкаSummarizing Sporting Events
Using Twitter (Nichols, Mahmud, Drews)
• Takes advantage of the fact that throughout the course of sports games, viewers generally tend to make Twitter updates expressing opinions about different events that occur throughout the game.
• Aims to generate a natural summary of the event that incorporates temporal cues, such as spikes in the volume of status updates, in order to identify important moments throughout the course of the game.
• Aims to implement a sentence ranking method that is used to extract relevant sentences from the tweet corpus -- each presumably referring to an important moment in the game.
Key idea: Summarize sporting events from a live corpus of tweets.
Образец заголовкаSumblr: continuous summarization of evolving tweet streams (Shou,
Wang, Ke Chen, Gang Chen)
• Traditional automatic summarization methods for text documents primarily focus on static and small-scale data.
• Sumblr (SUMmarization By stream cLusteRing) aims to summarize tweet streams -- thereby providing a dynamic summarization framework.
Key idea: Timeline-based framework for topic summarization for tweets. Algorithm ranks and selects a diverse crop of important tweets within a bunch of different sub-topic groups. These tweets serve as the basis of the summary that
will be composed for each sub-topic.
Образец заголовкаSumblr: continuous summarization of evolving tweet streams (Shou,
Wang, Ke Chen, Gang Chen)• During tweet stream clustering, it is necessary to maintain
statistics for tweets to facilitate summary generation. For this reason, the authors of the paper introduce a representation called “tweet cluster vector”(TCV).
• The Sumblr framework operates as follows:– At the start of the stream, we collect a small number of tweets
and use a k-means clustering algorithm to create the initial clusters. The corresponding TCVs are initialized.
– Incrementally update the TCVs whenever a new tweet arrives. At various points it time, the algorithm has to decide where to create a new centroid, add a tweet to an existing centroid, or merge/delete existing clusters.
– High-level summarization step produces online and historical summaries
Образец заголовкаComparison of Microblog Methods
Paper Pros Cons
Twitter Topic Summarization by Ranking Tweets Using Social Influence and Content Quality
Takes advantage of social influence of authors when ranking tweets; takes readability, content richness, a measure of the regularity of written language, and how pointless the content is into account.
Algorithm structure could thwart the summarization of more niche topics that have sparse follower-followee adjacency matrices.
Summarizing Sporting Events Using Twitter
Incorporates temporal cues; deals with live corpus
Does not use valuable social metadata to rank tweets
Topical Keyphrase Extraction from Twitter
Considers both relevance and interestingness
Key phrase extraction may not be as helpful as full sentence summarization for some use cases – especially for data that exhibits low topical phrase redundancies.
Sumblr: continuous summarization of evolving tweet streams
Provides a streaming summarization (as well as historical summaries)
Implementation is more complicated
Summarizing Microblogs Automatically
Graph algorithm’s relevance calculation does not let long sentences have an unfair advantage over shorter sentences with just as much important content.
Extracts only the longest sentence that contains the topic keyword(s) from each tweet as an input to the graph algorithm.
Образец заголовкаSummarization of Larger Data
• The following include summarization methods that can be applied to larger data as well. This includes reviews, documents, news articles, and so forth.
Образец заголовкаBasic Approach
• Extraction based approach for text summarization using k-means clustering (Agrawal , et al.)
Образец заголовкаExtraction based approach for text
summarization using k-means clustering (Agrawal , Gupta)
• At a high level, the algorithm proposed by the authors of this paper is an unsupervised learning approach that can be broken down into three steps: – tokenization of the document– computing a score for each sentence– clustering the sentences using k-means– extracting important sentences– and combining those sentences in order to form a
summary.
Key idea: incorporates k-means clustering, TF-IDF, and tokenization in order to perform extractive text summarization.
Образец заголовкаWhat makes this approach suboptimal?
• Does not take advantage of redundancy to rank importance.
• Method for extraction of important sentence(s) from each centroid is naïve and can be gamed.
Образец заголовкаMore Advanced Approaches
• Product review summarization from a deeper perspective (Ly, et al.)
• Mining and Summarizing Customer Reviews(Hu, et al.)
• Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions (Ganesan, et al.)
• Opinosis: A Graph-Based Approach to Abstractive Summarization of Highly Redundant Opinions (Ganesan, et al.)
Образец заголовкаProduct review summarization from a deeper perspective (Ly,
Sugiyama, Lin, Kan)
• The first step is Product Facet Identification. – In order to identify candidate facets, we need to
preprocess the input reviews.• This involves tagging part-of-speech, stemming,
assigning syntactic rules, and stop word removal. – We then deploy the Stanford Dependency Parser in
order to detect the role of each noun. • We want to discard nouns that aren’t subjects or objects.• We then use association rule mining to identify frequent
product facets.
Key idea: algorithm automatically summarizes a massive collection of product reviews and generates a concise, non-redundant summary. Not only does this system extract review sentiments but it also extracts the
underlying justifications behind the review sentiments.
Образец заголовкаProduct review summarization from a deeper perspective (Ly,
Sugiyama, Lin, Kan)• The second step is summarization.
– For each of the facets mined in the previous step, we want to associate it with relevant opinion sentences that match the appropriate polarity expressed by the majority of the opinions in the reference text.
– We first restrict our algorithm to run only on opinionated sentences from the reviews. • Furthermore, we perform sentiment analysis on the sentences
to assign a polarity score to each sentence (the sum of the polarity of each word in a sentence).
– We then calculate content-based pairwise similarities between all of the resultant opinion sentences. Using these scores, we perform clustering on the sentences.
– The final task is to select the most representative sentence from each centroid for the final summary.
Образец заголовкаMining and Summarizing Customer
Reviews(Hu, Minqing, Liu)
• The paper focuses on the problem of feature-based summaries of customer reviews of products sold online. In this context, “features” refers to product attributes.
• Given a customer review corpus that pertains to a given product, summarization is split into three subtasks: – First, we must identify the product features that customers are
speaking about. – Second, for each feature, we have to identify sentences in the
reviews that have positive or negative opinions.– Last, we must produce a summary that aggregates all of this
information.
Key ideas: Algorithm assists merchants in extracting the main ideas and themes from hundreds, if not thousands, of customer reviews through
product feature extraction and consideration of sentence sentiment
Образец заголовкаMining and Summarizing Customer
Reviews(Hu, Minqing, Liu)
Summarization system architecture
Образец заголовкаMicropinion Generation: An
Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions
(Ganesan, Zhai, Viega)Key ideas: greedy approach that heuristically prunes the exponential
solution space so that we only have to deal with promising candidates. Ultimate goal is to generate a compact and informative summary using a
set of micro-opinions.
• Micro-opinion: 2-5 word phrase• Formal problem set-up:
– Suppose we have a set of sentences Z =zi where i ∈[1,k] from an opinion document.
– Goal is to generate a micro-opinion summary, M =m where I ∈ [1,k] where |mi| ∈ [2,5] and each mi conveys a key opinion from Z.
– It is quite important to note that while we require that mi use words that occur at least once in the set Z, we do not require mi to be an exact subsequence of any of the sentences in Z. • This, this makes this set-up more of an abstractive summarization problem
rather than an extractive summarization problem.
Образец заголовкаMicropinion Generation: An
Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions
(Ganesan, Zhai, Viega)• Algorithm:
– Start with a set of high frequency unigrams from the original corpus.
– Then, start to merge these unigrams to generate higher order bigrams, trigrams, and n-grams.
– At each merge step, we make sure that the candidate n-grams have reasonably high readability and representativeness scores.
– The candidate generation process stops when an attempt to grow an existing candidate leads to low readability or representativeness scores.
– The final step is to sort all the candidate n-grams based on their objective function values (i.e., sum of Srepresentativeness and Sreadability) and generate a micro-opinion summary M by gradually adding phrases with the highest scores to our summary until the accumulated summary length reaches the length threshold.
Образец заголовкаOpinosis: A Graph-Based Approach to Abstractive Summarization of
Highly Redundant Opinions (Ganesan, Zhai, Han)
• The results of the evaluation studies show that when compared to the baseline extractive method, the Opinosis summaries are closer to human summaries.
• The high level picture of the algorithm: generate an abstractive summary by repeatedly searching the Opinosis graph for sub-graphs that basically represent semantically valid and meaningful sentences that happen to have high redundancy scores.
• It is important that these sentences have high redundancy scores because that means that they are representative of a major opinion.
• The sentences that are represented by these sub-graphs can be combined to form an abstractive summary.
Key idea: graph-based approach to automatic text summarization. The summarization framework generates concise abstractive summaries and
capitalizes on the presence of large amounts of redundancy in the opinions.
Образец заголовкаOpinosis: A Graph-Based Approach to Abstractive Summarization of
Highly Redundant Opinions (Ganesan, Zhai, Han)
• Opinosis constructs a graph that represents the original text. The paper isolates three properties of this graph that they exploit in order to explore and score various sub paths throughout the graph. These sub-paths are what help to generate the candidate abstractive summaries. – Properties:
• Redundancy Capture: extremely redundant textual occurrences are naturally captured by sub-graphs.
• Gapped Subsequence Capture: existing sentence structures create “lexical links”. These links then facilitate the discovery of new sentences.
• Collapsible Structures: nodes that resemble hubs can potentially be collapsed
Образец заголовкаComparison of Methods for Larger Data
Paper Pros ConsMining and Summarizing Customer Reviews
Takes sentiment into consideration Algorithm is restricted to run only on opinionated sentences. This could discard potentially valuable text.
Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions
Aims to capitalize on existing redundancy and maximize readability; abstractive summarization aims to mimic human summarization; prunes unpromising candidates
Does not take sentiment into consideration; essentially provides key phrases – which may not be optimal for all use cases
Opinosis: A Graph-Based Approach to Abstractive Summarization of Highly Redundant Opinions
Capitalizes on redundancy; emphasized readability; abstractive summarization mimics human summarization
Does not take sentiment into consideration
Extraction based approach for text summarization using k-means clustering
Implementation is simple and straightforward.
Does not take sentiment into consideration; doesn’t take advantage of redundancy to rank importance.
Product review summarization from a deeper perspective
Takes sentiment into consideration Sentiment consideration needs to be more sophisticated in order to account for complex English phrase structure (i.e. sentences “I am happy” and “I am not happy” should have an extremely high sentiment differential. This might not happen under this approach.)
Образец заголовкаFuture Strides • Ideally, we want to push towards
better abstractive summarization approaches. –We want to emulate human
summarization as closely as possible • Applications of deep learning to
automatic summarization• Highly visual automatic
summarizations