hercules dalianis page 1 automatic text summarization hercules dalianis nada-kth royal institute of...

32
Hercules Dalianis page 1 Automatic text Automatic text summarization summarization Hercules Dalianis NADA-KTH Royal Institute of Technology 100 44 Stockholm ph: +46-8-790 91 05 mobile: +46 70 568 13 59 email: [email protected]

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Hercules Dalianis page 1

Automatic text summarizationAutomatic text summarization

Hercules DalianisNADA-KTH

Royal Institute of Technology100 44 Stockholm

ph: +46-8-790 91 05mobile: +46 70 568 13 59email: [email protected]

Hercules Dalianis page 2

Overview of talkOverview of talk

• Background

• Other summarizers

• Technique

• Future improvements

• Applications

• Evaluation

Hercules Dalianis page 3

Automatic text summarizationAutomatic text summarization• Automatic text summarization is the method

where a computer summarizes a text. • A text is given to the computer and it returns

an non-redundant shorter text- An extract from a longer original text.

• The technique has it’s roots in the 60’s.• With the Internet and the WWW it has been

an awakening interest in summarization techniques.

Hercules Dalianis page 4

Summarization toolsSummarization toolshttp://www.nada.kth.se/~hercules/HDbookmarks.htm

http://www.ics.mq.edu.au/~swan/summarization/projects_full.htm

• Microsoft Word 97, 98 and Word 2000 have a summarizer for documents.

• Intelligent Miner for Text -Summarization tool IBM

• Inxight (XEROX) • Datahammer (Glucose Development

Corporation)

Hercules Dalianis page 5

• Corporum Summarizer- Cognit AS (Norway)• Pertinence (France)• Copernic Summarizer• MuST Prototype• Automated Text Summarization

(SUMMARIST) • Columbia Newsblaster

http://www1.cs.columbia.edu/nlp/newsblaster

• OracleContext • Autonomy

Hercules Dalianis page 6

What is Automatic summarization What is Automatic summarization good for?good for?

• News paper setting and printing– Sydsvenska Dagbladet, Bergens Tidene

• Summarize Scientific texts – Danmarks Elektroniske Forskningsbibliotek

• Telephone systems– Read summarized news synthetically

Hercules Dalianis page 7

• Search engines – to summarize documents for hitlist c.f. Google,

SiteSeeker.

• NewsAgent - Business Intelligence– TDT Topic Detection Tracking and

Columbia Newsblaster

Hercules Dalianis page 8

Hercules Dalianis page 9

Hercules Dalianis page 10

Hercules Dalianis page 11

Summarization approachesSummarization approaches• Extraction vs. Abstraction• Generic vs. Query based• Indicative vs. Informative• Restricted vs. Unrestricted domain• Background information vs. New information (TDT)• Single-document vs. Multiple-document• Monolingual vs. Multilingual• Textual vs. Multimedia

Hercules Dalianis page 12

Text summarizationText summarization

• Extraction is much easier than abstraction• Abstraction needs understanding and rewriting

Hercules Dalianis page 13

TechiquesTechiques

• Find what the text is about

• Then decide what so say

• Then decide how to say it• Text summarization (extraction) uses statistic,

linguistic and heuristic methods

Hercules Dalianis page 14

TechiquesTechiques

• A text is divided into sentences• Sentence positions (News/Reports) • Title words• Bold text, Numerical values, Citations• Named Entities (Frequence based)• Keyword frequency and extraction (nouns,

adverbs, adjectives) • Use morphological information-lemma

Hercules Dalianis page 15

Key word lexiconKey word lexicon

• Key words in news domain• Also called "open class word lexicon”• Key words can be noun, adjectives or adverbs

Hercules Dalianis page 16

• Word which are present in all other sentences.

• User adaptation– Use user keywords - Obtain slanted summaries

• Combination function of all rankings with different weights gives the rank of each sentence.

• Generate all high ranking sentences

• Voilá the summary !

Hercules Dalianis page 17

SweSumSweSum• The first text summarizer for Swedish• Summarizes Swedish news paper text in

HTML/text format on the WWW.• Uses a Swedish key word lexicon that

contains 40 000 words and their possible 700 000 inflections.

• During the text summarization are 5-10 key words produced which describes or categorizes the text -

• Key words - A miniature summary.

Hercules Dalianis page 18

The Swedish keyword lexiconThe Swedish keyword lexicon700 000 words 40 000 words

Inflected version Lemma

statsminister statsminister

statsministern statsminister

statsministerns statsminister

statsministrarna statsminister

statsministrarnas statsminister

.. ...

regeringen regeringen

regeringens regeringen

regeringarna regeringen

regeringarnas regeringen

... ....

Hercules Dalianis page 19

Hercules Dalianis page 20

SweSumSweSum

• SweSum is available to summarize news texts on Swedish, Danish, Norwegian, English, Spanish, French, German and in Farsi (Iranian).

Hercules Dalianis page 21

Hercules Dalianis page 22

• Textsammanfattningsbildspel

Hercules Dalianis page 23

Hercules Dalianis page 24

ProblemsProblems

• Pronoun and other anafora referenser• Kalle sprang. Han sprang fort.• Pronoun resolution

• Clauses can be too long or too short• Clause reductions- and clause combination rules• Aggregation

Hercules Dalianis page 25

SweSumSweSum without without PRMPRM

Analysera mera!

Regi: Harold Ramis

Medv: Robert De Niro, Billy Crystal, Lisa Kudrow

Längd: 1 tim, 45 min

Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Han accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och han är tämligen oemotståndlig. Här har han åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination.

SvD 99-10-08

Hercules Dalianis page 26

SweSum with PRMSweSum with PRM

Analysera mera!

Regi: Harold Ramis

Medv: Robert De Niro, Billy Crystal, Lisa Kudrow

Längd: 1 tim, 45 min

Ett av många skäl att glädjas åt Analysera mera är att Robert De Niro här verkligen utövar skådespelarkonst igen. Robert accelererar emotionellt från 0 till 100 på ingen tid alls, för att sedan kattmjukt bromsa in och parkera, lugnt och behärskat. Och Robert är tämligen oemotståndlig. Här har Harold åstadkommit ännu en intelligent komedi för alla oss vänner av intelligens och komedi, gärna i kombination.

SvD 99-10-08

Hercules Dalianis page 27

EvaluationEvaluation

• We found that if one summarizes the text to 30 percent of original length one will obtain around 70-80 percent accuracy on 3-4 pages news articles.

• .. but query based evaluations are based on subjective opinions

• These evaluation need large human effort• Small overlap of opinions• We need man-made extracts to compare the machine

made extracts automatically

Hercules Dalianis page 28

• There are some man-made extracts for English news texts.

• We had to create such extract for Swedish news text.

• We created KTH Extract Corpus- Corpus created manually once by voting

• Then one can compare the texts from SweSum and KTH Extract Corpus manually or soon automatically

Hercules Dalianis page 29

KTH extract corpusKTH extract corpus

• http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php

and

• http://www.nada.kth.se/iplab/hlt/kthxc/

• Visa celltexten

Hercules Dalianis page 30

• http://www.nada.kth.se/iplab/hlt/kthxc/showsumstats.php?cutoff=30&fileid=svenska-%3Etest-%3Etext001.htm

Hercules Dalianis page 31

Future improvements of SweSumFuture improvements of SweSum

• Tagging instead of static lexicons

• Clause level summarization

• Improved Named Entity recognition

• Improved Pronominal Resolution

• Lexical chains using SIMPLE and/or EuroWordNet

• Automatic evaluation method

Hercules Dalianis page 32

DemonstratorsDemonstrators• SweSum – Standard version

http://swesum.nada.kth.se/index-eng.html

• SweSum – Experimental NE versionhttp://www.nada.kth.se/~xmartin/swesum_lab/index-eng.html

(SweSum uses a Perl-CGI script, there is also a standalone version for plain text/html)