Download - Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore [email protected]

Extracting Keyphrases Extracting Keyphrases from Books using from Books using

Language Modeling Language Modeling ApproachesApproaches

Rohini U Rohini U AOL India R&D,AOL India R&D, Bangalore IndiaBangalore India

[email protected]@corp.aol.com

Vamshi AmbatiVamshi AmbatiLanguage Technologies InstituteLanguage Technologies Institute

Carnegie Mellon UniversityCarnegie Mellon UniversityPittsburgh, USAPittsburgh, USA

[email protected]@cs.cmu.edu

AgendaAgenda

Keyphrase Extraction Keyphrase Extraction Value addition to Digital Libraries Value addition to Digital Libraries Methods of Keyphrase ExtractionMethods of Keyphrase ExtractionRelated WorkRelated WorkOur SolutionOur Solution

What are Keyphrases?What are Keyphrases?

KeyphrasesKeyphrases(Give example)(Give example)

Where used?Where used?Cataloguing in Libraries for IR purposesCataloguing in Libraries for IR purposesQuick Summarization of documentsQuick Summarization of documents

Why important to ULIB?Why important to ULIB?

Vast growth in digital contentVast growth in digital contentMore than a Million books!More than a Million books!Short Meta data description – useful Short Meta data description – useful

to user while readingto user while readingFor further processing of books like For further processing of books like

summarization, IR etcsummarization, IR etc

How do we extract KPs?How do we extract KPs?

Manual entryManual entryReliable, high quality outcomeReliable, high quality outcomeBut, time-consuming, expensiveBut, time-consuming, expensive

AutomaticAutomaticFast extraction but less reliableFast extraction but less reliableNo expense at allNo expense at all

Automatic techniques for KPEAutomatic techniques for KPE

Rule based methodsRule based methodsHeuristics (paragraph beginning, headline Heuristics (paragraph beginning, headline

etc)etc)Krulwich &Burkey etcKrulwich &Burkey etcUsing Linguistic toolsUsing Linguistic tools

Statistical techniquesStatistical techniquesTerm counts and weighting based MethodsTerm counts and weighting based MethodsLearn model from training dataLearn model from training data

Turney et. al[5], KEA[6] , KSpotter[3] etcTurney et. al[5], KEA[6] , KSpotter[3] etc

Requirements for a KPE for Requirements for a KPE for ULIBULIB

Automatic Identification of Automatic Identification of Keyphrases from chapters of booksKeyphrases from chapters of books

Language independentLanguage independentEasily adaptable for different Easily adaptable for different

domainsdomainsNo training data to learn fromNo training data to learn from

Most books in ULIB do not have Most books in ULIB do not have keywords as part of the metadatakeywords as part of the metadata

Solution OutlineSolution Outline

Language Modeling basedLanguage Modeling based Given n-gramsGiven n-grams

Measure Measure Informativeness, PhrasenessInformativeness, PhrasenessScore n-grams based on the above Score n-grams based on the above

measuresmeasuresPick top Pick top K K phrases as Keyphrasesphrases as Keyphrases

Extracting Keyphrases from Extracting Keyphrases from BooksBooksCleaning

& Initialization

Candidate Keyphrases

Extraction

Scoring

Pruning

Extracted Keyphrases

Text

Extracting Keyphrases from Extracting Keyphrases from BooksBooks

Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited


Cleaning &Initialization


Extraction

Scoring

Pruning

topics construct user profiles explicit specification interests automatic analysis web pages visited


Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited


Cleaning & Initialization


Extraction

Scoring

Pruning


{topics construct user, construct user profiles,user profiles explicit,

profiles explicit specification,explicit specification interests,

specification interests automatic,automatic analysis web,

analysis web pages,web pages visited }






Extraction

Scoring

Pruning


profiles explicit specication : 0.0281explicit specication interests : 0.0281specication interests automatic : 0.0272user proles explicit : 0.0260construct user proles : 0.0260interests automatic analysis : 0.0255topics construct user : 0.0243automatic analysis web : 0.0227web pages visited : 0.0226analysis web pages : 0.0217

ScoringScoring

PhrasenessPhraseness Measures degree to which a given n-gram can Measures degree to which a given n-gram can

be considered a phrasebe considered a phrase Based on Co-occurrence of wordsBased on Co-occurrence of words Example..Example..

InformativenessInformativeness Measures how informative a given n-gram isMeasures how informative a given n-gram is There is a, a lot ofThere is a, a lot of etc etc Comparing co occurrence on a general corpus Comparing co occurrence on a general corpus

Vs given text(book)Vs given text(book) Total ScoreTotal Score

Phraseness-Score + Informativeness-ScorePhraseness-Score + Informativeness-Score

Scoring - PhrasenessScoring - Phraseness

Computed by measuring distance Computed by measuring distance between unigram model and N-gram between unigram model and N-gram modelmodel

Point wise KL-divergence (Takashi et. Point wise KL-divergence (Takashi et. al 2004)al 2004) δδw (p||q) = p(w)log(p(w)/q(w))

Phraseness measurePhraseness measure δδw (LMfg

N|| LMfg1)

Scoring - InformativenessScoring - Informativeness

Computed by measuring distance Computed by measuring distance between n-gram model from given between n-gram model from given data and n-gram model from general data and n-gram model from general datadata

Point wise KL-divergence (Takashi et. Point wise KL-divergence (Takashi et. al 2004)al 2004) δδw (p||q) = p(w)log(p(w)/q(w))

Informativeness measureInformativeness measure δδw (LMfg

1|| LMbg1)






Extraction

Scoring

Pruning


profiles explicit specication : 0.0281explicit specication interests : 0.0281specication interests automatic : 0.0272user proles explicit : 0.0260construct user proles : 0.0260interests automatic analysis : 0.0255topics construct user : 0.0243automatic analysis web : 0.0227web pages visited : 0.0226analysis web pages : 0.0217






Extraction

Scoring

Pruning


proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages

Conclusions and Future WorkConclusions and Future Work

Discussed benefits of Keyphrases in Discussed benefits of Keyphrases in ULIB contextULIB context

Demonstrated the building of a KPE Demonstrated the building of a KPE that works for booksthat works for books

Robust evaluationRobust evaluationBuilding a test set from books in ULIB for Building a test set from books in ULIB for

generic robust evaluation of KPE tools generic robust evaluation of KPE tools Are chapters really independent in a Are chapters really independent in a

bookbookRevisit the assumption Revisit the assumption

Thank youThank you

ReferencesReferences1. Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts.

Information Processing and Management, 29(4):433-447, 1993.

2. S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page 148-155. ACM Press, 1998.

3. Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, 2003. ACM Press.

4. Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

5. P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336, 2006.

6. I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages 254-255. ACM Press, 1999.

7. Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001

Download - Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore [email protected]

Top Related