summarizing encyclopedic term descriptions on the web from coling 2004 atsushi fujii and tetsuya...
TRANSCRIPT
![Page 1: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/1.jpg)
Summarizing Encyclopedic Term Descriptions on the Web
from Coling 2004Atsushi Fujii and Tetsuya Ishikawa
Graduate School of Library, Information and Media Studies, University of Tsukuba
![Page 2: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/2.jpg)
motivation
Existing encyclopedias often lack new terms and new definitions for existing terms
Web contains an enormous volume of up-to-date information is a source to obtain new term descriptions
The use of existing search engine has many problems
![Page 3: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/3.jpg)
search engine??
Often retrieve extraneous pages not describing a submitted term
A user has to identify page fragments describing the term
Descriptions in multiple pages are independent
Word senses are not distinguished for ambiguous terms
![Page 4: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/4.jpg)
They propose a summarization method that produces a concise and condensed term description from multiple paragraphs
In this paper, they focus on Japanese technical terms in the computer domain
![Page 5: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/5.jpg)
Overview of CYCLONE
![Page 6: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/6.jpg)
![Page 7: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/7.jpg)
Summarization Method
Given a set of paragraph-style descriptions for a single term in a specific domain, their summarization method produces a concise text describing the term from different viewpoints
12 viewpoints in computer domain: definition, abbreviation, exemplification, purpose, synonym, reference, product, advantage, drawback , history, component, function
![Page 8: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/8.jpg)
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single group
Selection Determine one or more representative units for each group
Presentation Produce a summary in a format
![Page 9: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/9.jpg)
Identification
A sentence is often associated with multiple viewpointse.g. XML is an abbreviation for eXtensible Markup Language, and is markup language
Segment Japanese sentences into simple sentences, and apply zero pronoun detection and anaphora resolution can be used
XML is an abbreviation for eXtensible Markup Language XML is markup language
Abbreviation viewpoint
definition viewpoint
![Page 10: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/10.jpg)
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single
group
Selection Determine one or more representative units for each group
Presentation Produce a summary in a format
![Page 11: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/11.jpg)
Classification
12 viewpoints 36 linguistic patterns are used to describe
terms from a specific viewpoint Simple sentences match with patterns for
multiple viewpoints is classified into viewpoint group
![Page 12: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/12.jpg)
Classification (cont)
How about those sentences do not match any patterns?
Classify remaining sentences into the group where their most similar sentence is belong
Compute the similarity between an unclassified sentences and each of the classified sentences (Dice coefficient)
“miscellaneous” group
![Page 13: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/13.jpg)
example
![Page 14: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/14.jpg)
Four steps
Identification Recognize the language unit associated with a viewpoint
Classification Merge units with the same viewpoint into a single group
Selection Determine one or more representative units for each
group
Presentation Produce a summary in a format
![Page 15: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/15.jpg)
Selection The number of sentences selected from each group
depends on the desired size of the resultant summary
Compute the score for each sentence and select sentences with greater scores in each group # of common words included (W) – sentences including
frequent words are preferred Rank order in CYCLONE (R) # of characters include (C) – short sentences are preferred
Normalize each factor and compute final score as a weighed average of the three factors above (W>R>C)
![Page 16: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/16.jpg)
Selection (cont)
For miscellaneous group, they select the most dissimilar sentence to representative sentences selected from the regular groups
![Page 17: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/17.jpg)
Presentation
![Page 18: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/18.jpg)
Top 50 paragraphs for the term “XML” Only one sentence was selected from each
group Each viewpoint label or sentence is hyper-
linked to the associated group or the source paragraph
Presentation (cont)
![Page 19: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/19.jpg)
Evaluation
Summarization evaluation can be classified into intrinsic and extrinsic approaches
Intrinsic: the quality of a text, informativeness Extrinsic: if a summary improves the efficiency of
a specific task
![Page 20: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/20.jpg)
Evaluation (cont)
15 Japanese terms are test inputs In order to calculate the coverage, for each of the
15 terms, two students annotate each simple sentence in the top 50 paragraphs in the CYCLONE results with one or more viewpoints
They define 28 viewpoints including the 12 viewpoints
Compression ratio and coverage were calculate by the top 50 paragraphs
![Page 21: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/21.jpg)
Results
#Reps: the number of representative sentences selected from each viewpoint group
#Chars: the number of characters in a summary They select five sentences from the miscellaneous
group VBS: viewpoint-based summarization method Lead: systematically extracted the top N characters
from the CYCLONE results
![Page 22: Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f585503460f94c7e65a/html5/thumbnails/22.jpg)
Conclusion
To compile encyclopedic term descriptions from the Web, they introduced a summarization method
They identify the simple sentences, classify those sentences into viewpoint groups, select the representative sentences from each group and show them up
VBS got good compression ratio and the coverage score is better than baseline
Future work includes generating a coherent text and performing extrinsic evaluation method