Extracting Keyphrases Extracting Keyphrases from Books using from Books using
Language Modeling Language Modeling ApproachesApproaches
Rohini U Rohini U AOL India R&D,AOL India R&D, Bangalore IndiaBangalore India
[email protected]@corp.aol.com
Vamshi AmbatiVamshi AmbatiLanguage Technologies InstituteLanguage Technologies Institute
Carnegie Mellon UniversityCarnegie Mellon UniversityPittsburgh, USAPittsburgh, USA
[email protected]@cs.cmu.edu
AgendaAgenda
Keyphrase Extraction Keyphrase Extraction Value addition to Digital Libraries Value addition to Digital Libraries Methods of Keyphrase ExtractionMethods of Keyphrase ExtractionRelated WorkRelated WorkOur SolutionOur Solution
What are Keyphrases?What are Keyphrases?
KeyphrasesKeyphrases(Give example)(Give example)
Where used?Where used?Cataloguing in Libraries for IR purposesCataloguing in Libraries for IR purposesQuick Summarization of documentsQuick Summarization of documents
Why important to ULIB?Why important to ULIB?
Vast growth in digital contentVast growth in digital contentMore than a Million books!More than a Million books!Short Meta data description – useful Short Meta data description – useful
to user while readingto user while readingFor further processing of books like For further processing of books like
summarization, IR etcsummarization, IR etc
How do we extract KPs?How do we extract KPs?
Manual entryManual entryReliable, high quality outcomeReliable, high quality outcomeBut, time-consuming, expensiveBut, time-consuming, expensive
AutomaticAutomaticFast extraction but less reliableFast extraction but less reliableNo expense at allNo expense at all
Automatic techniques for KPEAutomatic techniques for KPE
Rule based methodsRule based methodsHeuristics (paragraph beginning, headline Heuristics (paragraph beginning, headline
etc)etc)Krulwich &Burkey etcKrulwich &Burkey etcUsing Linguistic toolsUsing Linguistic tools
Statistical techniquesStatistical techniquesTerm counts and weighting based MethodsTerm counts and weighting based MethodsLearn model from training dataLearn model from training data
Turney et. al[5], KEA[6] , KSpotter[3] etcTurney et. al[5], KEA[6] , KSpotter[3] etc
Requirements for a KPE for Requirements for a KPE for ULIBULIB
Automatic Identification of Automatic Identification of Keyphrases from chapters of booksKeyphrases from chapters of books
Language independentLanguage independentEasily adaptable for different Easily adaptable for different
domainsdomainsNo training data to learn fromNo training data to learn from
Most books in ULIB do not have Most books in ULIB do not have keywords as part of the metadatakeywords as part of the metadata
Solution OutlineSolution Outline
Language Modeling basedLanguage Modeling based Given n-gramsGiven n-grams
Measure Measure Informativeness, PhrasenessInformativeness, PhrasenessScore n-grams based on the above Score n-grams based on the above
measuresmeasuresPick top Pick top K K phrases as Keyphrasesphrases as Keyphrases
Extracting Keyphrases from Extracting Keyphrases from BooksBooksCleaning
& Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
Extracted Keyphrases
Text
Extracting Keyphrases from Extracting Keyphrases from BooksBooks
Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited
Extracted Keyphrases
Cleaning &Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
topics construct user profiles explicit specification interests automatic analysis web pages visited
Extracting Keyphrases from Extracting Keyphrases from BooksBooks
Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited
Extracted Keyphrases
Cleaning & Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
topics construct user profiles explicit specification interests automatic analysis web pages visited
{topics construct user, construct user profiles,user profiles explicit,
profiles explicit specification,explicit specification interests,
specification interests automatic,automatic analysis web,
analysis web pages,web pages visited }
Extracting Keyphrases from Extracting Keyphrases from BooksBooks
Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited
Extracted Keyphrases
Cleaning & Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
topics construct user profiles explicit specification interests automatic analysis web pages visited
profiles explicit specication : 0.0281explicit specication interests : 0.0281specication interests automatic : 0.0272user proles explicit : 0.0260construct user proles : 0.0260interests automatic analysis : 0.0255topics construct user : 0.0243automatic analysis web : 0.0227web pages visited : 0.0226analysis web pages : 0.0217
ScoringScoring
PhrasenessPhraseness Measures degree to which a given n-gram can Measures degree to which a given n-gram can
be considered a phrasebe considered a phrase Based on Co-occurrence of wordsBased on Co-occurrence of words Example..Example..
InformativenessInformativeness Measures how informative a given n-gram isMeasures how informative a given n-gram is There is a, a lot ofThere is a, a lot of etc etc Comparing co occurrence on a general corpus Comparing co occurrence on a general corpus
Vs given text(book)Vs given text(book) Total ScoreTotal Score
Phraseness-Score + Informativeness-ScorePhraseness-Score + Informativeness-Score
Scoring - PhrasenessScoring - Phraseness
Computed by measuring distance Computed by measuring distance between unigram model and N-gram between unigram model and N-gram modelmodel
Point wise KL-divergence (Takashi et. Point wise KL-divergence (Takashi et. al 2004)al 2004) δδw (p||q) = p(w)log(p(w)/q(w))
Phraseness measurePhraseness measure δδw (LMfg
N|| LMfg1)
Scoring - InformativenessScoring - Informativeness
Computed by measuring distance Computed by measuring distance between n-gram model from given between n-gram model from given data and n-gram model from general data and n-gram model from general datadata
Point wise KL-divergence (Takashi et. Point wise KL-divergence (Takashi et. al 2004)al 2004) δδw (p||q) = p(w)log(p(w)/q(w))
Informativeness measureInformativeness measure δδw (LMfg
1|| LMbg1)
Extracting Keyphrases from Extracting Keyphrases from BooksBooks
Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited
Extracted Keyphrases
Cleaning & Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
topics construct user profiles explicit specification interests automatic analysis web pages visited
profiles explicit specication : 0.0281explicit specication interests : 0.0281specication interests automatic : 0.0272user proles explicit : 0.0260construct user proles : 0.0260interests automatic analysis : 0.0255topics construct user : 0.0243automatic analysis web : 0.0227web pages visited : 0.0226analysis web pages : 0.0217
Extracting Keyphrases from Extracting Keyphrases from BooksBooks
Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited
Extracted Keyphrases
Cleaning & Initialization
Candidate Keyphrases
Extraction
Scoring
Pruning
topics construct user profiles explicit specification interests automatic analysis web pages visited
proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages
Conclusions and Future WorkConclusions and Future Work
Discussed benefits of Keyphrases in Discussed benefits of Keyphrases in ULIB contextULIB context
Demonstrated the building of a KPE Demonstrated the building of a KPE that works for booksthat works for books
Robust evaluationRobust evaluationBuilding a test set from books in ULIB for Building a test set from books in ULIB for
generic robust evaluation of KPE tools generic robust evaluation of KPE tools Are chapters really independent in a Are chapters really independent in a
bookbookRevisit the assumption Revisit the assumption
ReferencesReferences1. Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts.
Information Processing and Management, 29(4):433-447, 1993.
2. S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page 148-155. ACM Press, 1998.
3. Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, 2003. ACM Press.
4. Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, 2003. Association for Computational Linguistics.
5. P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336, 2006.
6. I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages 254-255. ACM Press, 1999.
7. Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001