automatic summarizaton tutorial

34
Образец заголовка Tutorial on Automatic Summarization by Shilpa Subrahmanyam Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Upload: shilpa-subrahmanyam

Post on 16-Apr-2017

486 views

Category:

Education


0 download

TRANSCRIPT

Образец заголовка

Tutorial on Automatic Summarizationby Shilpa Subrahmanyam

Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Образец заголовкаThe World Today• We live in an age in which a massive amount

of content is available at our fingertips– 500 million tweets are posted every day– 2 million + articles are posted daily– The average article length at the New York

Times is about 1,200 words. • Automatic summarization can prove

extremely useful in attempting to generate insights and themes from such large corpuses of data.

Образец заголовкаWhat is Automatic Summarization?

• Definition: Employing an algorithm to distill a text corpus to a considerably smaller body of important ideas, sentences, phrases, etc.

• Key Challenges: – Determining how to rank the importance of a

sentence, word, or phrase. – Eliminating and capitalizing on redundancy – Incorporating sentiment into the summary – Ensuring the summary is readable – Avoiding a search through an exponential

solution space

Образец заголовкаTypes of Automatic Summarization

• Extractive Summarization – Select a subset of existing words, phrases,

or sentences in the original text in order to generate a summary.

• Abstractive Summarization – Aims to create a summary that is closer to

what a human might generate.– Phrases in the summary don’t necessarily

need to have appeared in the original text• Keyword/Key-phrase extraction

Образец заголовкаApplications and Use Cases• Summarizing tweets in order to

determine the timeline of a sports game

• Product owners summarizing highly redundant product reviews (in order to get popular opinions and key insights)

• Distilling a long article to a set of key points.

Образец заголовкаSummarization of Small

Data (i.e. Tweets, microblogs, etc.)

• These are summarization methods that are targeted at Tweets, microblogs, and other content-limited data.

• A lot of work has been done recently in this area because of the increasing availability of large amounts of content-limited small data samples that are readily available via the advent of Twitter and similar sites that value smaller sized content.

Образец заголовкаBasic Approaches • Topical Keyphrase Extraction

from Twitter (Zhao, et al.)• Summarizing Microblogs

Automatically (Sharifi, et al.)

Образец заголовкаTopical Keyphrase Extraction from Twitter (Xin Zhao, Jiang, He, Song,

Achananuparp, Lim, Li)

• The method proposed by the authors is a context-sensitive topical PageRank method for keyword ranking.

• This is paired with a probabilistic scoring function that considers two factors of key phrases when doing key-phrase ranking:– relevance– interestingness

Key idea: Generate a list of topical key phrases that will serve as a summarization of a corpus of tweets.

Образец заголовкаSummarizing Microblogs

Automatically (Sharifi, Hutton, Kalita)

• Start with a topic or phrase and generate tweets that are related to that topic or phrase.

• Isolate the longest sentence in each tweet that contains the topic phrase. We use this set of sentences as the input to our algorithm.

• Build a graph representing the common sequences of words (the common phrases) that occur before and after the key topic phrase. – This root node represents the topic phrase. – Each word is represented by a node and a count that

indicates how many times the word occurs within the set of input sequences. A phrase is represented in the graph by a sequence of nodes starting with the root.

Key Idea: take a trending phrase, collect a huge number of tweets containing that trending phrase, and provide an automatically generated

summary of the tweets that were collected.

Образец заголовкаSummarizing Microblogs

Automatically (Sharifi, Hutton, Kalita)

• Assign each node to a weight. This is in order to prevent longer phrases from dominating the output. – Words are given weights that are proportional to their count.

• Construct “partial summary” by searching the graph for a path with the largest total weight (it searches all paths that begin with the root node and end with a non-root node). – This path represents the most common phrase occurring

either before or after the topic phrase. • Run algorithm once more. This time, we need to

initialize the root node with the partial summary and rebuild the graph. This time around, the most heavily weighted path from the new graph is the final summary produced by the algorithm

Образец заголовкаWhy are these approaches suboptimal?

• Topical Key phrase Extraction from Twitter – Key phrase extraction can often produce noisy results (i.e. key

phrases that are common but don’t help to identify themes). – Moreover, oftentimes, it may not be sufficient for summative

purposes to simply look through a list of key phrases. • More detail may be required for a sufficient grasp of the original text.

• Summarizing Microblogs Automatically– This approach is predicated on the fact that we specifically

retrieve tweets that all pertain to the same trending phrase as an input to our algorithm.

– Extracts only the longest sentence that contains the topic keyword(s) from each tweet as an input to the graph algorithm. • The equation of length of a sentence to importance could prove

fallacious. Furthermore, this could mean discarding valuable information before the algorithm even begins.

Образец заголовкаWhat are some more advanced approaches?

• Twitter Topic Summarization by Ranking Tweets Using Social Influence and Content Quality (YaJuan, et al.)

• Summarizing Sporting Events Using Twitter (Nichols, et al.)

• Sumblr: continuous summarization of evolving tweet streams (Shou, et al.)

Образец заголовкаTwitter Topic Summarization by

Ranking Tweets Using Social Influence and Content Quality (YaJuan, ZhuMin, FuRu, Ming,

Heung − Yeung)

• This approach takes advantage of follower-followee relationships on Twitter -- which is the main manner in which social influence of users is inferred. The quality of tweets is judged based a few factors that are incorporated into the graph-based ranking algorithm: – readability– content richness– a measure of the regularity of written language– pointless degree of the content.

• In order to curb redundancy within the final summary, the model selects tweets from the ranking results using a Maximal Marginal Relevance algorithm (Carbonell and Goldstein, 1998).

Key idea: Algorithm models and formulates the ranking of tweets in a unified mutual reinforcement graph.

Образец заголовкаTwitter Topic Summarization by

Ranking Tweets Using Social Influence and Content Quality (YaJuan, ZhuMin, FuRu, Ming,

Heung − Yeung)• Algorithm models the problem of tweet ranking in a unified mutual reinforcement graph. – In this model, social influence of users

and a measure of the quality of the tweet content are both taken into consideration (in a simultaneously mutually reinforcing manner).

Образец заголовкаSummarizing Sporting Events

Using Twitter (Nichols, Mahmud, Drews)

• Takes advantage of the fact that throughout the course of sports games, viewers generally tend to make Twitter updates expressing opinions about different events that occur throughout the game.

• Aims to generate a natural summary of the event that incorporates temporal cues, such as spikes in the volume of status updates, in order to identify important moments throughout the course of the game.

• Aims to implement a sentence ranking method that is used to extract relevant sentences from the tweet corpus -- each presumably referring to an important moment in the game.

Key idea: Summarize sporting events from a live corpus of tweets.

Образец заголовкаSumblr: continuous summarization of evolving tweet streams (Shou,

Wang, Ke Chen, Gang Chen)

• Traditional automatic summarization methods for text documents primarily focus on static and small-scale data.

• Sumblr (SUMmarization By stream cLusteRing) aims to summarize tweet streams -- thereby providing a dynamic summarization framework.

Key idea: Timeline-based framework for topic summarization for tweets. Algorithm ranks and selects a diverse crop of important tweets within a bunch of different sub-topic groups. These tweets serve as the basis of the summary that

will be composed for each sub-topic.

Образец заголовкаSumblr: continuous summarization of evolving tweet streams (Shou,

Wang, Ke Chen, Gang Chen)• During tweet stream clustering, it is necessary to maintain

statistics for tweets to facilitate summary generation. For this reason, the authors of the paper introduce a representation called “tweet cluster vector”(TCV).

• The Sumblr framework operates as follows:– At the start of the stream, we collect a small number of tweets

and use a k-means clustering algorithm to create the initial clusters. The corresponding TCVs are initialized.

– Incrementally update the TCVs whenever a new tweet arrives. At various points it time, the algorithm has to decide where to create a new centroid, add a tweet to an existing centroid, or merge/delete existing clusters.

– High-level summarization step produces online and historical summaries

Образец заголовкаComparison of Microblog Methods

Paper Pros Cons

Twitter Topic Summarization by Ranking Tweets Using Social Influence and Content Quality

Takes advantage of social influence of authors when ranking tweets; takes readability, content richness, a measure of the regularity of written language, and how pointless the content is into account.

Algorithm structure could thwart the summarization of more niche topics that have sparse follower-followee adjacency matrices.

Summarizing Sporting Events Using Twitter

Incorporates temporal cues; deals with live corpus

Does not use valuable social metadata to rank tweets

Topical Keyphrase Extraction from Twitter

Considers both relevance and interestingness

Key phrase extraction may not be as helpful as full sentence summarization for some use cases – especially for data that exhibits low topical phrase redundancies.

Sumblr: continuous summarization of evolving tweet streams

Provides a streaming summarization (as well as historical summaries)

Implementation is more complicated

Summarizing Microblogs Automatically

Graph algorithm’s relevance calculation does not let long sentences have an unfair advantage over shorter sentences with just as much important content.

Extracts only the longest sentence that contains the topic keyword(s) from each tweet as an input to the graph algorithm.

Образец заголовкаSummarization of Larger Data

• The following include summarization methods that can be applied to larger data as well. This includes reviews, documents, news articles, and so forth.

Образец заголовкаBasic Approach

• Extraction based approach for text summarization using k-means clustering (Agrawal , et al.)

Образец заголовкаExtraction based approach for text

summarization using k-means clustering (Agrawal , Gupta)

• At a high level, the algorithm proposed by the authors of this paper is an unsupervised learning approach that can be broken down into three steps: – tokenization of the document– computing a score for each sentence– clustering the sentences using k-means– extracting important sentences– and combining those sentences in order to form a

summary.

Key idea: incorporates k-means clustering, TF-IDF, and tokenization in order to perform extractive text summarization.

Образец заголовкаWhat makes this approach suboptimal?

• Does not take advantage of redundancy to rank importance.

• Method for extraction of important sentence(s) from each centroid is naïve and can be gamed.

Образец заголовкаMore Advanced Approaches

• Product review summarization from a deeper perspective (Ly, et al.)

• Mining and Summarizing Customer Reviews(Hu, et al.)

• Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions (Ganesan, et al.)

• Opinosis: A Graph-Based Approach to Abstractive Summarization of Highly Redundant Opinions (Ganesan, et al.)

Образец заголовкаProduct review summarization from a deeper perspective (Ly,

Sugiyama, Lin, Kan)

• The first step is Product Facet Identification. – In order to identify candidate facets, we need to

preprocess the input reviews.• This involves tagging part-of-speech, stemming,

assigning syntactic rules, and stop word removal. – We then deploy the Stanford Dependency Parser in

order to detect the role of each noun. • We want to discard nouns that aren’t subjects or objects.• We then use association rule mining to identify frequent

product facets.

Key idea: algorithm automatically summarizes a massive collection of product reviews and generates a concise, non-redundant summary. Not only does this system extract review sentiments but it also extracts the

underlying justifications behind the review sentiments.

Образец заголовкаProduct review summarization from a deeper perspective (Ly,

Sugiyama, Lin, Kan)• The second step is summarization.

– For each of the facets mined in the previous step, we want to associate it with relevant opinion sentences that match the appropriate polarity expressed by the majority of the opinions in the reference text.

– We first restrict our algorithm to run only on opinionated sentences from the reviews. • Furthermore, we perform sentiment analysis on the sentences

to assign a polarity score to each sentence (the sum of the polarity of each word in a sentence).

– We then calculate content-based pairwise similarities between all of the resultant opinion sentences. Using these scores, we perform clustering on the sentences.

– The final task is to select the most representative sentence from each centroid for the final summary.

Образец заголовкаMining and Summarizing Customer

Reviews(Hu, Minqing, Liu)

• The paper focuses on the problem of feature-based summaries of customer reviews of products sold online. In this context, “features” refers to product attributes.

• Given a customer review corpus that pertains to a given product, summarization is split into three subtasks: – First, we must identify the product features that customers are

speaking about. – Second, for each feature, we have to identify sentences in the

reviews that have positive or negative opinions.– Last, we must produce a summary that aggregates all of this

information.

Key ideas: Algorithm assists merchants in extracting the main ideas and themes from hundreds, if not thousands, of customer reviews through

product feature extraction and consideration of sentence sentiment

Образец заголовкаMining and Summarizing Customer

Reviews(Hu, Minqing, Liu)

Summarization system architecture

Образец заголовкаMicropinion Generation: An

Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

(Ganesan, Zhai, Viega)Key ideas: greedy approach that heuristically prunes the exponential

solution space so that we only have to deal with promising candidates. Ultimate goal is to generate a compact and informative summary using a

set of micro-opinions.

• Micro-opinion: 2-5 word phrase• Formal problem set-up:

– Suppose we have a set of sentences Z =zi where i ∈[1,k] from an opinion document.

– Goal is to generate a micro-opinion summary, M =m where I ∈ [1,k] where |mi| ∈ [2,5] and each mi conveys a key opinion from Z.

– It is quite important to note that while we require that mi use words that occur at least once in the set Z, we do not require mi to be an exact subsequence of any of the sentences in Z. • This, this makes this set-up more of an abstractive summarization problem

rather than an extractive summarization problem.

Образец заголовкаMicropinion Generation: An

Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

(Ganesan, Zhai, Viega)• Algorithm:

– Start with a set of high frequency unigrams from the original corpus.

– Then, start to merge these unigrams to generate higher order bigrams, trigrams, and n-grams.

– At each merge step, we make sure that the candidate n-grams have reasonably high readability and representativeness scores.

– The candidate generation process stops when an attempt to grow an existing candidate leads to low readability or representativeness scores.  

– The final step is to sort all the candidate n-grams based on their objective function values (i.e., sum of Srepresentativeness and Sreadability) and generate a micro-opinion summary M by gradually adding phrases with the highest scores to our summary until the accumulated summary length reaches the length threshold.

Образец заголовкаOpinosis: A Graph-Based Approach to Abstractive Summarization of

Highly Redundant Opinions (Ganesan, Zhai, Han)

• The results of the evaluation studies show that when compared to the baseline extractive method, the Opinosis summaries are closer to human summaries.

• The high level picture of the algorithm: generate an abstractive summary by repeatedly searching the Opinosis graph for sub-graphs that basically represent semantically valid and meaningful sentences that happen to have high redundancy scores.

• It is important that these sentences have high redundancy scores because that means that they are representative of a major opinion.

• The sentences that are represented by these sub-graphs can be combined to form an abstractive summary.  

Key idea: graph-based approach to automatic text summarization. The summarization framework generates concise abstractive summaries and

capitalizes on the presence of large amounts of redundancy in the opinions.

Образец заголовкаOpinosis: A Graph-Based Approach to Abstractive Summarization of

Highly Redundant Opinions (Ganesan, Zhai, Han)

• Opinosis constructs a graph that represents the original text. The paper isolates three properties of this graph that they exploit in order to explore and score various sub paths throughout the graph. These sub-paths are what help to generate the candidate abstractive summaries. – Properties:

• Redundancy Capture: extremely redundant textual occurrences are naturally captured by sub-graphs.

• Gapped Subsequence Capture: existing sentence structures create “lexical links”. These links then facilitate the discovery of new sentences.

• Collapsible Structures: nodes that resemble hubs can potentially be collapsed

Образец заголовкаComparison of Methods for Larger Data

Paper Pros ConsMining and Summarizing Customer Reviews

Takes sentiment into consideration Algorithm is restricted to run only on opinionated sentences. This could discard potentially valuable text.

Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Aims to capitalize on existing redundancy and maximize readability; abstractive summarization aims to mimic human summarization; prunes unpromising candidates

Does not take sentiment into consideration; essentially provides key phrases – which may not be optimal for all use cases

Opinosis: A Graph-Based Approach to Abstractive Summarization of Highly Redundant Opinions

Capitalizes on redundancy; emphasized readability; abstractive summarization mimics human summarization

Does not take sentiment into consideration

Extraction based approach for text summarization using k-means clustering

Implementation is simple and straightforward.

Does not take sentiment into consideration; doesn’t take advantage of redundancy to rank importance.

Product review summarization from a deeper perspective

Takes sentiment into consideration Sentiment consideration needs to be more sophisticated in order to account for complex English phrase structure (i.e. sentences “I am happy” and “I am not happy” should have an extremely high sentiment differential. This might not happen under this approach.)

Образец заголовкаFuture Strides • Ideally, we want to push towards

better abstractive summarization approaches. –We want to emulate human

summarization as closely as possible • Applications of deep learning to

automatic summarization• Highly visual automatic

summarizations

Образец заголовка

Thanks for your time!