mianwei zhou, kevin chen-chuan chang university of illinois at urbana-champaign entity-centric...

Click here to load reader

Upload: kaylyn-foley

Post on 28-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features Slide 2 Much of the Information Sought on the Web nowadays is about Entities. 2 The Web A Huge Entity Database We love George!! OMG! IPad Air is coming out~~ How to improve our products quality? TREC-KBA Task How to help Wikipedia editors enrich Wikipedia? TREC-KBA Task How to help Wikipedia editors enrich Wikipedia? Slide 3 Proposal: Entity-Centric Document Filtering System 3 Slide 4 Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for Entities Billions of News, blogs, forums, tweets... entity-centric document filtering system Interested Entities Irrelevant Documents Relevant Documents 4 Slide 5 INPUT: Only Entity Name is Usually Insufficient. 5 Slide 6 INPUT: Use Identification Page to Characterize the Target Entity. Entity Identification Pages 1.Resolve the ambiguity problem. 2.Provide more information about the entity 6 Slide 7 OUTPUT: Relevant/Irrelevant Documents for Target Entities. Bill Gates Michael Jordan (NBA Player) RelevantIrrelevant Bill Gates, speaking as co- founder of Microsoft, will give a talk next Tuesday... Steve Jobs story is completely different from Bill Gates... Michael Jordan is considered by many the best basket player in NBA history Michael Jordan is a Leading researcher in machine learning and AI. Michael Jordan is a Leading researcher in machine learning and AI. 7 Slide 8 Problem: Entity-Centric Learning to Filter 8 Slide 9 Problem: Entity-Centric Learning to Filter Training Phase Testing Phase Wiki Page RelevantIrrelevant Wiki Page RelevantIrrelevant Entity-centric Document Filter Wiki Page ? ??? 9 Slide 10 How to Predict Document Relevance for an Entity Characterized by an Identification Page? Traditional IR models such as BM25, language model do not work. Designed for Short Queries Entity Pages contain many Noisy Keywords 10 Slide 11 Our Idea: Check if the document mentions about the most basic information of the entity. Microsoft Windows Seattle Philanthropist 11 Slide 12 Challenge: Learning Across Entities. 12 Slide 13 For an Entity with Labeled Documents, Learning its Important Keywords is Simple. Relevant DocumentIrrelevant Document Bill Gates, speaking as co- founder of Microsoft, will give a talk next Tuesday... Steve Jobs story is completely different from Bill Gates... 13 Relevance of document d for entity e Slide 14 However, Such Keyword Importance is Not Adaptable to Other Entities. Microsoft Windows Seattle Philanthropist NBA Chicago Bull MVP UNC Training Entities (with Labeled Documents) New Entities (without Labeled Documents) Keyword Importance Transfer 14 Slide 15 Insight: Meta-feature Based Keyword Mapping 15 Slide 16 Keyword: Microsoft Keyword: Chicago Bull 1.are mentioned a lot in their Wiki Pages. 2.are organization. 3.appear in the info-box..... Similar Importance 16 Both of them... Slide 17 Meta-Feature -- Features of Features: Properties that are related to keyword importance 17 General Meta-Feature IDF, IsNoun, InEntity,... ID-Page-Related Meta-Feature Wiki Page InInfobox, InOpenPara,... Amazon Page InSpec, InReview,... Slide 18 Solution: Boosting Mapping Model 18 Slide 19 Clustering-based Keyword Mapping 19 Training Phase Microsoft Harvard Cascade Hollywood NKU CFR... the is this a here as the... Testing Phase NBA UNC Bobcats Wiki the must there... NBA UNC Bobcats Wiki the must there... Slide 20 Document Relevance based on Keyword Clusters 20 Keyword Clusters Keyword Importance Slide 21 Traditional Clustering Algorithm Might Fail 21... the WA for October programmer consistentlyMS Oscar actor is Occupation Hollywood screenwriter 1. Irrelevant Meta-Features might Lead to Useless Clusters 2. Different Possible Ways of Clustering. Which one is better? OR ? Slide 22 BoostMapping: Boosting Effective Clusters 22 Microsoft Harvard Cascade Hollywood NKU CFR... the is this a here as the Document Labels Objective of Clustering: Boosting the Prediction Accuracy of Relevance Only Useful Clusters are Generated. Slide 23 BoostMapping : 1. Initialization: Uniform Document Importance 23 Slide 24 BoostMapping : 2. Enumerate Conditions to Generate the Most Predictive Cluster. 24 Achieve the Highest Prediction Accuracy Slide 25 25 BoostMapping: 3. Update the Document Distribution Slide 26 BoostMapping: 4. Generate the Next Cluster Under the Current Document Distribution 26 Slide 27 27 Update the document distribution again BoostMapping: 5. Repeat the Process Until the Predict Accuracy Converge Slide 28 Experiment 28 Slide 29 Three Datasets 29 TREC-KBA 29 person entities, 52,238 documents Wikipedia pages as ID pages Product 39 product entities, 2,398 documents Amazon pages as ID pages MilQuery (From Million Query Track) 143 general entities, 8,208 documents. Wikipedia pages ad ID pages. Hostage Rescue Kodak Dinosaur Slide 30 Performance Comparison with Baselines 30 QueryByName: Use Entity Names As Queries QBD-TFIDF: Use TFIDF to Select Important Keywords as Queries. VectorSim: Measure Relevance Based on Query-Document SimilarityLinearMapping: Keyword Mapping based on a Linear Function. Slide 31 31 Thanks! Q&A