[ieee 2008 seventh mexican international conference on artificial intelligence (micai) - mexico,...

Single Document Summarization Based onLocal Topic Identification and Word Frequency

Zhi Teng, Ye Liu, Fuji Ren and Seiji TsuchiyaFaculty of Engineering

The University of Tokushima2-1 Minamijosanjima Tokushima, Japan 770-8506{teng, liuye, ren, tsuchiya}@is.tokushima-u.ac.jp

Fuji RenSchool of Information Engineering

Beijing University of Postsand TelecommunicationsBeijing, China 100876

Abstract

In this task, an approach for single document summariesbased on local topic identification and word frequency isproposed. In recent years, there has been increased inter-est in automatic summarization. The physical features areoften used and have been successfully applied to this field;it also has some disadvantages of non-redundancy, struc-ture and coherence. Therefore, we introduced logical struc-ture feature which has been successfully applied in multi-document summarization (MDS), and we designed a systemto accomplish this task. Documents can be clustered intolocal topic after sentences similarity is calculated, whichcan be sorted by the scoring. Then sentences from all localtopics are selected by computing the word frequency. Usingthis proposed method, the information redundancy of eachlocal topic and among local topic is reduced. The informa-tion coverage ratio and structure of the summarization isimproved.

1. Introduction

Text summarization is a product of electronic documentexplosion, and can be seen as the condensation of the doc-ument collection. The use of text summarization allows auser to get a sense of the content of a full-text, or to knowits information content, without reading all sentences withinthe full-text. Data reduction increases scale by allowingusers to find relevant full-text sources more quickly, and as-similating only essential information from many texts withreduced effort [1]. Generic summarization system is di-vided into three modules: text preprocessing, summariza-tion algorithm, and postprocessing. In Figure 1.

Figure 1. Summarization System Architec-ture

2. Related Work

We mainly investigate previous text summarization workrelated to local topic and frequency methodologies becauseour work largely involves these two approaches.

Word frequency, Summarization using larger units oftext has also been researched. Most recently, the SumBasicalgorithm uses term frequency as part of a context-sensitiveapproach to identifying important sentences while reducinginformation redundancy [2]. The use of frequency as a fea-ture in locating important areas of a text has been provenuseful in the literature [3]. This is most likely due to reiter-ation, where authors state important information in several

2008 Seventh Mexican International Conference on Artificial Intelligence

978-0-7695-3441-1/08 $25.00 © 2008 IEEE

DOI 10.1109/MICAI.2008.12

37

different ways, in order to reinforce main points [4]. Localtopic identification approach, Qin [5] did a multi-documentsummarization experiment based on local topic identifica-tion and extraction. The similarity of sentences is measuredby analysis of dependency and semanteme.

Local topic is found by sentence clustering. The centroidsentence is extracted from each local topic and is orderedto generate summarization. Zhang [6] proposed a methodfor representing and summarizing documents by integratingsubtopic partition with graph representation. The methodstarts from the assumption that capturing sub-topic structureof document collection is essential for summarization.

In the word frequency research, everyone interest in howto find out good physical features arithmetic, did not atten-tion structure and coherence. In the local topic identifica-tion research, everyone usually interest in how to improvethe non-redundancy and coherence used structure feature.Thus, we did an experiment by consideration on physicalfeatures and logical structure feature (see Section 3 for de-tails).

3. Summarization Methods

3.1. Local Topic Identification

Text summarization is a data reduction process. As sum-mary is concise, accurate and explicit, it becomes more andmore important. One document can be composed of somesubdocuments. The described contents in each of the doc-uments laid special emphasis on different aspects althoughthese documents were all surround the same topic. It is ob-vious that document is composed of different side informa-tion and the different side information is the local topic.

In terms of the logical structure, a single-document setcan also be considered as a set of local topics, a semanticparagraph comprises of similarity sentences, and each se-mantic paragraph is called a local topic. This approach candeal with the redundancy and improve the structure and co-herence in the large documents.

In this task, after the similarity of sentences is measuredby the Vector Space Model (VSM) method, the local topic isfound by sentence clustering. The VSM has been one of themost successful models in various performance evaluationstudies [9]. Most existing search engines and informationretrieval systems are based on this model.

In this model, both training data and testing data are rep-resented as vectors of index terms. Suppose there are a totalof t index terms in an entire collection, then a given docu-ment D and query Q can be represented as follows:

D = (Wd1, Wd2, , Wdt)Q = (Wq1, Wq2, , Wqt)

where Wdi, Wqi (i = 1 to t) are term weights assignedto different terms for the document D and query Q, respec-tively. The similarity between a query and a document canbe calculated by the widely used cosine measure [10]:

Sim(Q,D) =∑t

i=1 Wqi ×Wdi√∑ti=1(Wqi)2 ×

∑ti=1(Wdi)2

(1)

Documents then are ordered by decreasing values of thismeasure.

The art of designing ranking functions depends on thedesign of term weighting strategies to assign weights forterms in a document, Wdi. Different term weighting strate-gies influence the similarity that is computed according toFormula 1.

Here we defined a threshold value named η. After thesimilarity measurement, we incorporate the sentences in thesame local topics if the similarity value of sentence is largerthan η. The optimal value of η, 0.59, was empirically deter-mined using the similarity values of 2000 pairs of sentences.

3.2. Local Topic Score

After local topic identification processing, we have tocompute and order each local topic, cycled pick the bestsentence from the best local topic if the desired summarylength has not been reached. The core system is SumBasic[2]. The SumBasic is a generic algorithm; it did not in-clude other feature or information, therefore, we improvedthe SumBasic by sentence location feature and SumFocusmethod.

3.2.1 SumFocus

SumFocus made by Lucy Vanderwende [7], a new approachin the multi-document summarization system, captures theinformation conveyed by the topic description by comput-ing the word probabilities of the topic description. In ourexperiment corpus every text includes news topic and con-tent. The weight for each word is computed as a linearcombination of the unigram probabilities derived from thetopic description, with back off smoothing to assign wordsnot appearing in the topic a very small probability, and theunigram probabilities from the document, in the followingmanner:

WordWeight = (1−λ)×DocWeight+λ×TopicWeight(2)

The optimal value of λ, 0.9, was empirically deter-mined using the DUC2005 corpus, manually optimizing onROUGE-2 scores.

38

3.2.2 Sentence Location Feature

Sentence Location Feature is also important except wordsoccurring frequently. According to the statistics, in Chinesenews, the probability that the first sentence can be pickedand make summarization is about 85%. The probability thatthe last sentence can be picked and make summarization isabout 7% [8]. Thus we have to adjust the sentence weightthat in the especial location. We use the following algo-rithm:

Weight(Lj) = (1 +(P − 0.75m)2

m2)× STj (3)

For each sentence Lj in the local topics, P is the serialnumber of sentence Lj in the document. m is the numberof sentence in the document. STj is the similarity valuebetween sentence Lj and title of the document.

3.3. Pick Sentences and Make Summariza-tion

The improved SumBasic approach is described in thefollowing manner:

Step1. Compute the probability distribution over thewords Wi appearing in the document, p(wi) for every i usedformula 2.

Step 2. For each sentence Sj in the local topics, computethe Weight(Sj) and Weight(Lj) , calculate the scoring ofeach sentence and the scoring of each local topic. We usedfollowing algorithm:

Score(LocalTopic) =∑

(αWeight(Sj)+βWeight(Lj))(4)

Here Weight(Sj) =∑

wi∈Sj

p(wi)|{wi|wi∈Sj}| . We find out

that we can get the best result when α = 0.9 and β = 0.1through some experimentations by used different α and β.

Step 3. Pick the best scoring sentence that contains thehighest probability word from the best scoring local topic.

Step 4. Delete this local topic and all sentences in thislocal topic.

Step 5. For each word wi in the sentence chosen at step3, update their probability:

Pnew(wi) = Pold(wi)× Pold(wi) (5)

Step 6. If the desired summary length has not beenreached, go back to Step 2.

Step 7. Use the sentences that picked up by Step 3 togenerate summarization.

Step 2 ensures that the highest probability word is in-cluded in the summary; Step 3 gives the summarizer sensi-tivity to context and Step 5 gives a natural way to deal withthe redundancy.

4. Evaluation

4.1. Corpus

A corpus of 45 news texts in Chinese was selected froma daily news database. The daily news database containsapproximately 266 texts and all of them were downloadedfrom news website. Every news text includes a news topicand content. Of the 266 news texts, 45 texts were deliber-ately selected. The selection method is that:

Data 1: selected 15 texts which amount of sentence arenot more than 10;

Data 2: selected 15 texts which amount of sentence be-tween 10 and 20;

Data 3: selected 15 texts which amount of sentence aremore than 20.

4.2. Vpercent and Hpercent

In order to evaluate our proposed approach, we usedtwo evaluation criterion, V percent and Hpercent [8].V percent is the coverage ratio of valid word (excludestopwords). Hpercent is the coverage ratio of highfrequency word (the valid word that appears more thantwice). The measure is as follows:

V percent =SV Words

TV Word× 100% (6)

Hpercent =SHWords

THWord× 100% (7)

TV Word: the number of valid words in the originaltexts

THWord: the number of high frequent words in theoriginal texts

SV Words: the number of valid words in the summa-rization

SHWords: the number of high frequent words in thesummarization

4.3. NIST

The linguistic quality questions are targeted to assesshow readable and fluent the summaries are, and they mea-sure qualities of the summary that DO NOT involve com-parison with a model summary or DUC topic. There arefive NIST linguistic quality questions are relevant to auto-matic summaries: grammaticality, non-redundancy, refer-ential clarity, focus and structure & coherence 1.

1http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt

39

Table 1. The Vpercent results based on Top-Nand our system

TOP-N Our SystemData 1 32.98% 38.98%Data 2 29.53% 39.84%Data 3 25.87% 38.43%Average 29.46% 39.08%

Table 2. The Hpercent results based on Top-Nand our system

TOP-N Our SystemData 1 14.55% 28.17%Data 2 15.72% 29.18%Data 3 19.34% 28.12%Average 16.54% 28.49%

NIST evaluated these questions by eliciting human judg-ments on a five point scale from ”1” to ”5”, where ”5” indi-cates that the summary is good with the respect to the qual-ity under question, ”1” indicates that the summary is badwith respect to the quality stated in the question, and ”2” to”4” show the gradation in between.

5. Results

The experiment object is that we want to test the fea-sibility and validity of the proposed approach. To do thisexperiment we deliberately selected 45 news texts in Chi-nese from a daily news database. The summary length isnot more than 20% words of original text.

To compare the result of our approach with existed meth-ods, the Top-N method was used to build a reference sum-marization used the same disaster texts. The method of Top-N is that the first sentence in each paragraph in document istaken in turn until the number of sentences is satisfied. Ifthe number of sentences is not satisfied, repeat the processon the second sentence of each paragraph.

5.1. Evaluation by Vpercent and Hpercent

In order to compute the V percent and Hpercent, wemade a program to automatically evaluate the result.The Ta-ble 1 and Table 2 shows the coverage ratio of valid word andhigh frequency word based on Top-N and our approach.

5.2. Evaluation by NIST

In order to evaluate the NIST results, we invited 3 exami-nees who will assess the quality of the summary. Before the

Table 3. The NIST evaluation resultsNon-redundancy

Our System TOP-NData 1 4.42 3.55Data 2 4.20 3.18Data 3 4.12 3.30Average 4.26 3.34

Structure and CoherenceOur System TOP-N

Data 1 3.93 3.12Data 2 3.53 2.60Data 3 3.30 2.23Average 3.59 2.65

examinees ask the question, they didn’t know the anythingof this system. Table 3 shows the results of NIST evalua-tion, the results of each test Data is the average value of 3examinee’s evaluation value.

5.3. Discussion

The V percent and Hpercent results show that the in-formation coverage ratio of our approach is 39.08%, higherthan compression ratios 20%, and 9.62% higher than Top-N method. The coverage ratio of high frequency word is28.49%, higher than compression ratios 20%, and 11.95%higher than Top-N method. The NIST evaluation re-sults show the Non-redundancy and Structure & Coherencehigher than Top-N results. We can think that the local topicsidentification improved the Non-redundancy and Structure& Coherence; word frequency improved the coverage ratioof valid word and high frequency word. The results showthat better performance is achieved when the two methodsare combined in practice.

6. Conclusions

In this paper, we described our new approach for sin-gle document summaries based on local topics identifica-tion and word frequency. We assess the contribution of ourapproach to single document summaries. The V percent,Hpercent and NIST evaluation results show our approachundoubtedly contributes to non-redundancy and acquisitionof better information coverage ratio.

We think we have to improve our system use sentencesimplification and lexical expansion in the future.

The goal of sentence simplification is that produces sum-maries with as much content as possible that satisfies theuser, given a set limit on length. We view sentence simpli-fication as a means of creating more space within which tocapture important content.

40

We believe the lexical expansions can be applied at thepoint where we choose the ”best words” with which to de-termine local topic and sentence selection.

7. Acknowledgments

This research has been partially supported by the Min-istry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B), 19300029.

References

[1] Lawrence H. Reeve, Hyoil Han and Ari D. Brooks.2007. The use of domain-specific concepts in biomed-ical text summarization. Information Processing andManagement, 43 (2007) 1765-1776.

[2] Nenkova, A. and Vanderwende, L. 2005. The impact offrequency on summarization. No. MSR-TR-2005-101.Redmond, Washington: Microsoft Research.

[3] Hovy, E. and Lin, C. 1999. Automated text summariza-tion in SUMMARIST. In I. Mani & M. T.

[4] Sparck Jones, K. 1999. Automatic summarizing: Fac-tors and directions. In I. Mani & M. T. Maybury (Eds.),Advances in automatic text summarization (pp. 2-12).Cambridge, MA: MIT Press.

[5] Qin Bing, Liu Ting and Li Sheng. 2005. MDS basedon sub-topic NCIRCS-2005, Beijing.

[6] Jin Zhang, Hongbo Xu, Xiaolei Wang, Huawei Shenand Yiling Zeng. 2007. ICT CAS at DUC 2007. DUC2007. April 26-27, 2007. New York USA.

[7] Lucy Vanderwende, Hisami Suzuki, Chris Brockett andAni Nenkova. 2007. Beyond SumBasic: Task-focusedsummarization with sentence simplification and lexicalexpansion. Information Processing and Management,43 (2007) 1606-1618.

[8] Qin Bing, Liu Ting Chen Shang-Lin and Li Sheng.2006. Sentences Optimum Selection for Multi-document Summarization. Journal of Computer Re-search and Development, Vol.43, pp.1129-1134.

[9] Harman, D. K. 1995. Overview of the fourth text re-trieval conference (TREC-4). In D. K. Harman (Ed.),Proceedings of the fourth text retrieval conference.NIST Special Publication, 500-236 (pp. 1-24).

[10] Salton, G. and Buckley, C. 1987. Term weighting ap-proaches in automatic text retrieval. Information Pro-cessing and Management, 24(5), 513-523.

41