knowledge based text representation for information...

70
© 2016 Chenyan Xiong Knowledge Based Text Representation for Information Retrieval Chenyan Xiong [email protected] Language Technologies Institute Carnegie Mellon University 1

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Knowledge Based Text Representationfor Information Retrieval

Chenyan Xiong

[email protected]

Language Technologies Institute

Carnegie Mellon University

1

Page 2: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Information Retrieval: Representation and Ranking Models

Document Ranking, the core task of information retrieval

2

Query

Carnegie Mellon

Computer

……………

……………

……………

……………

……………

…………….

Andrew

Carnegie

……………

……………

……………

……………

……………

…………….

CMU

………………

………………

………………

………………

………………

………………

……………….

Documents

{101:1, 233:1}

{101:10, 234:5,456:3…}

{151:5, 233:5,601:3…}

{456:30, 562:7, 233:1…}

𝒒

𝒅𝟏

Representation

𝒅𝟐

𝒅𝟑

Ranking Model

Carnegie Mellon

d1: -3.56

d2: -5.12

d3: -13.2

Page 3: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Information Retrieval:State-of-the-art

3

Representation Ranking Model

Unsupervised

Language Model VSM

BM25 DPH COOR

Learning to Rank

Pointwise: LR, SVM…

Pairwise: RankSVM…..

Listwise: SVMMAP, ListMLE…

Page 4: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Information Retrieval:State-of-the-art

4

Representation Ranking Model

Unsupervised

Language Model VSM

BM25 DPH COOR

Learning to Rank

Pointwise: LR, SVM…

Pairwise: RankSVM…..

Listwise: SVMMAP, ListMLE…

Bag-of-Words

• Represent text by its

individual words

• State-of-the-art

• The backbone of many text

related tasks

Page 5: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Text Understanding of Human beings

5

Page 6: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Text Understanding of Human beings

6

Page 7: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Text Understanding of Human beings

7

My heart is in the “work”Good at CS

Located In

Won a lot Tony and

Turing Award

Page 8: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Graph based storage of human knowledge:

Locate In

Name

“Carnegie Mellon

University”

Type

University

Description

Carnegie Mellon University is a

private research university in

Pittsburgh, Pennsylvania.Name

“Pittsburgh”

Type

City

“Steel City”,

“Home of the Steelers”Alias

Human Knowledge in Knowledge Base

8

Large Scale:

• DBpedia

• Freebase

• NELL

Semi-structured

Carefully Curated

External to IR systems

Page 9: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Our Goal: Knowledge Based Text Representation (ReMAKE)

9

Carnegie MellonQuery

Documents

From bag-of-words

• Vocabulary

Mismatch

• Ambiguity

• Synonymy

• Shallow Text

understanding

To Knowledge Based Representations

Carnegie MellonQuery

Documents

Inference

Inference

• More Evidences

• High Quality

Domain Knowledge

• Structured

Representation

• Deeper Text

Understanding

Count

Count

{101:10, 234:5,456:3…}

{101:1, 233:1}

Exact Match Intelligent Match

Page 10: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

• EsdRank [Entity-based] (CIKM 2015)

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

Document Representation

• Bag-of-entities [Entity-based] (Under review)

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

• Co-Learning: Entity Finding and Document Ranking

• Entity Graph Representation

10

Page 11: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

QUERYREPRESENTATION

11

Page 12: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Query Representation with Knowledge Base

Entity plays a crucial role in query understanding

12

• >70% web (Bing) queries contain entities

[Guo et al. 2009]

• >50% web (Yahoo!) queries are searching

for entities [Pound et al. 2010]

• ~100% head and medium (AOL) queries

contains Wikipedia entities

• Ambiguity frequently comes from entities

[TREC Web Track]

Obama Family Tree

Query

Barack Obama

Family Tree

Michael Obama

DNAUSA

Kenya

Page 13: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Three Ways to Find Related Entities

Query Annotation

• Entities that appear in the query are useful

13

Related

Entities

Page 14: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Three Ways to Find Related Entities

Entity Search

• Entities that are retrieved by the query are useful

14

Related

Entities

Page 15: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Three Ways to Find Related Entities

Document Annotations

• Entities that frequently appear in the top retrieved documents are useful (Pseudo Relevance Feedback).

15

Related

Entities

Page 16: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Examples of Selected Entities

Query: ESPN

Entity

Search

Document

Annotation

ESPN ESPN

Sports radio ABC Sports

ESPN Radio Bill Simmons

Sports Center Sports Center

ESPN

International

NBA

16

Query: Yahoo

Entity

Search

Document

Annotation

Yahoo Yahoo

Yahoo News Microsoft

Yahoo Music Jerry Yang

Yahoo Mail Associated Press

Yahoo

Messenger

YUI Library

Page 17: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

• EsdRank [Entity-based] (CIKM 2015)

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

Document Representation

• Bag-of-entities [Entity-based] (Under review)

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

• Co-Learning: Entity Finding and Document Ranking

• Entity Graph Representation

17

Page 18: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Query Expansion with FreebaseTerm Based Query Representation from Knowledge Base

18Xiong and Callan [ICTIR 2015]

Page 19: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Big Picture

Expansion with Freebase in two steps:

1. Related entity selection

2. Expansion term selection

Two methods in each step:

• Select related entities by:

– Entity search

– Entity annotation in top retrieved docs

• Select expansion terms by:

– PRF in related entities’ descriptions

– Category similarity with query

19

Page 20: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Term Selection Method 1:PRF in Entities’ Descriptions

Entities’ descriptions as Pseudo Relevance Feedback documents

• PRF score of a term 𝑡𝑖 from entities’ descriptions:

𝑠(𝑡𝑖) =

𝑒𝑘∈𝐸

𝑡𝑓𝑖,𝑘 × 𝑖𝑑𝑓𝑖 × 𝑠(𝑒𝑘)

20

Selected entities Tf and idf in

entity description

Score from entity

selection

Page 21: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Term Selection Method 2:Category Similarity

Select terms by category similarities with query

21

Trained on entity

descriptions in each category

Distribution on

Freebase’s first level

categories (77 total)

Negative JS divergence as

term’s expansion score

Page 22: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Unsupervised Expansion Methods:Summary

Entity Selection × Term Selection = Four unsupervised methods

22

Entity Search Document

Annotation

Pseudo Relevance

Feedback

FbSearchPRF FbFaccPRF

Category Similarity FbSearchCat FbFaccCat

Two term selection methods

Two entity selection methods

Page 23: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Evaluation: Unsupervised Expansion

ClueWeb09 + TREC Queries

23

All our query expansion with

Freebase methods outperform all

baselines, on all metrics.

Combination of our methods provide further improvements.

More details in the paper.

Page 24: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

• EsdRank [Entity-based] (CIKM 2015)

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

Document Representation

• Bag-of-entities [Entity-based] (Under review)

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

• Co-Learning: Entity Finding and Document Ranking

• Entity Graph Representation

24

Page 25: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

EsdRank: Connecting Query and Document Through External Semi-structured Data

Entity Based Query Representation from Knowledge Base

25Xiong and Callan [CIKM 2015]

Page 26: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

EsdRank: A General Framework to Use External Semi-Structured Data for Ranking

26

Entities as bridges

between query and

documents

Incorporate more information

from knowledge bases as features

Page 27: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

EsdRank: A General Framework to Use External Semi-Structured Data for Ranking

27

A novel ranking model: Latent-ListMLE

• Learns how to represent query and how to rank documents

together

Latent

Space

Observed Observed

Page 28: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Latent-ListMLE

Latent-ListMLE: Connects q and docs through a latent layer

• For each rank position i:

1. Sample object 𝑜𝑖 based on query: 𝑝 𝑜𝑖 𝑞)

2. Sample document 𝑑𝑖 from unselected documents 𝑆𝑖: 𝑝(𝑑𝑖|𝑜𝑖, 𝑆𝑖)

28

The probability/importance

of latent objects to the query

The set of docs

not ranked yet

ListMLE-like ranking generation,

but from latent objects

Page 29: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

𝑝 𝑜𝑗 𝑞 =exp(𝜃𝑇𝑣𝑗)

σ𝑘=1𝑚 exp(𝜃𝑇𝑣𝑗)

𝑝 𝑑𝑖 𝑜𝑗 , 𝑆𝑖 =exp(𝑤𝑇𝑢𝑖𝑗)

σ𝑘=𝑖𝑛 exp(𝑤𝑇𝑢𝑘𝑗)

Latent-ListMLE

The likelihood of a given rank list 𝑫 :

𝑝 𝐷 𝑞;𝑤, 𝜃 =ෑ

𝑖=1

𝑛

𝑗=1

𝑚

𝑝 𝑑𝑖 𝑜𝑗 , 𝑆𝑖 𝑝(𝑜𝑗|𝑞)

• where:

Learning by maximizing the likelihood of best rank with EM

29

For each document For each latent object

Parameters

Features

Page 30: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Relevant Entities:Three Ways to Find Related Entities

Candidate entities in the latent layer are selected from:

30

Query Annotation (AnnQ) Entity Search (Osrch) Document Annotation (AnnD)

Page 31: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Features

Query-Entity Features

• Textual similarities

– BM25, language model, etc.

• Scores produced by related objects selectors

– Annotation score

• Ontology overlap

– In external data’s ontology

• Frequency of object in corpus annotation

– “IDF” effect

Entity-Document Features

• Textual Similarities

– BM25, language model, etc.

• Connection in the knowledge graph

– Document reachable in 1-2 hops, etc.

• Ontology Overlap

– In external data’s ontology

• Document quality

– Spam score, URL length…

31

Page 32: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Experiment Datasets

Two different scenarios

32

Scenario 1 Scenario 2

Corpus Web Page• ClueWeb09-B

• ClueWeb12-B13

Medical Abstracts• OSHUMED

Query From Web Users (Bing’s)• TREC Web Track 2009-2013

From Experts • OHSUMED

External

Data

Modern Knowledge Base• Freebase

Controlled Vocabulary• Medical Subject

Headings (MeSH)

Page 33: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Web Search Results: ClueWeb09-B

33

Unsupervised Retrieval

Supervised + Freebase

Baselines:

Learning to Rank

Knowledge base helps!• EsdRank-AnnQ (Query Annotation)

outperforms all baselines, on all metrics,

with large margins.

Similar results on ClueWeb12-B13

Page 34: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Medical Search Results: OSHUMED

34

Unsupervised Retrieval

Baselines:

Learning to Rank

Consistent results on OSHUMED

• Same method

• Different corpus (smaller and cleaner)

• Different external data (MeSH)

Page 35: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

• EsdRank [Entity-based] (CIKM 2015)

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

Document Representation

• Bag-of-entities [Entity-based] (Under review)

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

• Co-Learning: Entity Finding and Document Ranking

• Entity Graph Representation

35

Page 36: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Learning to Rank Related Entities

A Preliminary Study about Finding Better Entities

[SIGIR 2016 Short] Work with Jing Chen. 36

Page 37: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Learning to Rank Related Entities

Improve entity search to find better entities

Query Annotation Entity Search Document Annotation

37

This work

Page 38: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Introduce LeToR to Entity Search

Name

Alias

Attributes

(All Other Attributes)

Related Entity

(Connected Entities’ Names)

Category

(Name of Types)

Virtual Document for An Entity

Query

+ RankSVM

Coor-Ascent

LeToR Models

…….

+

Ranking Features

LM SDM

BM25 Cosine

Boolean …….

38Nikita Zhiltsov et al. SIGIR 2015

Page 39: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Evaluation Results

Experiments on DBpedia search [Balog 2013]

• Four query types (Keywords and Questions), 485 queries

New state-of-the-arts in Entity Search

39

Prior best: Fielded SDM

Page 40: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Summary: Query Representation

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

– Enrich query with knowledge base terms

– State-of-the-art in unsupervised ranking

• EsdRank [Entity-based] (CIKM 2015)

– Learning to build entity representation and to rank together

– State-of-the-art in supervised ranking

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

– Start using supervisions in finding better entities

– A preliminary work; state-of-the-art in entity search

40

Page 41: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline

Query Representation

• Query Expansion with Freebase [Term-based] (ICTIR 2015)

• EsdRank [Entity-based] (CIKM 2015)

• Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short)

Document Representation

• Bag-of-entities [Entity-based] (Under review)

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

• Co-Learning: Entity Finding and Document Ranking

• Entity Graph Representation

41

Page 42: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

DOCUMENT REPRESENTATION

42

Page 43: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Bag-of-Entities Representation for RankingMoving to the Entity Space

43Short paper under review

Page 44: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Classic Controlled Vocabulary Based IR

Controlled

Vocabularies

Bag-of-Words

Information Unit Predefined Term Word

Structure Ontology Flat

Text Understanding Domain Expertise Shallow

Cost Manual Efforts Automatic

Coverage Small Full

Usage Specific Domain General Domain

+

+

+

-

-

-

-

44

+ -

+

+ -

Page 45: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

From Controlled Vocabulary to Knowledge Base

Controlled

Vocabularies

Bag-of-Words

Information Unit Predefined Term Word

Structure Ontology Flat

Text Understanding Domain Expertise Shallow

Cost Manual Efforts Automatic

Coverage Small Full

Usage Specific Domain General Domain

+

+

+

-

-

-

-

45

+ -

+

+ -

Automatic

Entity Linking

Large Scale

Modern KBs

Our Approaches Work for

General Domain Queries

Knowledge Base:

Page 46: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

From Controlled Vocabulary to Knowledge Base

Controlled

Vocabularies

Bag-of-Words

Information Unit Predefined Term Word

Structure Ontology Flat

Text Understanding Domain Expertise Shallow

Cost Manual Efforts Automatic

Coverage Small Full

Usage Specific Domain General Domain

+

+

+

-

-

-

-

46

+ -

+

+ -

Automatic

Entity Linking

Large Scale

Modern KBs

Our Approaches Work for

General Domain Queries

Knowledge Base:

It is time to revisit the classic controlled vocabularies

based approach, with modern knowledge bases.

Page 47: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Bag-of-Entities Representation

Represent a document by its entity annotations

Document

Bag-of-Entities

Entity

Linking

Three Linking Systems:• Google FACC1, High Precision

• CMNS, High Recall

• TagMe, Balanced

Two Ranking Models• Coordinate Match (COOR)

• Entity Frequency (EF)

47

Page 48: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Evaluation: Entity Linking Coverage

Coverage on ClueWeb09 Queries and Documents

• The annotation is no long sparse

48

Number of Entities

Per Q/Doc

Average Number of

Entities Per Term

Fraction of Q/Doc

has no annotation

Results on ClueWeb12 are similar.

Page 49: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Evaluation: Entity Linking

Entity linking’s Accuracy on Queries

• Ground truth: manual labels [Dalton 2014] and [Liu 2015]

49

50% ~ 60% Accuracy, Good Enough?

Page 50: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Evaluation: BOE Ranking ClueWeb09

Ranking Performance of BOE, ClueWeb09

50

Standard BOW

Ranking (Indri)

BOE Works Well with

Simple Ranking Models

Results on ClueWeb12 are similar.

Page 51: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Summary: Current Progress

Query Representation

• Effective term based and entity based representation from knowledge bases

• A preliminary study about finding better entities

Documents Representation

• BOE work shows the potential of entity based document representation

Conclusion

• Knowledge bases have the power in search

• IR is ready to move from the word space to the knowledge space

51

Page 52: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Summary: Challenges

Query Representation

• The quality of related entities

– The bottleneck of all our systems

Document Representation

• The uncertainties from entity representations

– Brings new noises to ranking systems

• The heterogeneity of the entity space

– Ranking intuitions of words may not be suitable for entities

Knowledge base is much richer than individual terms/entities

• Connections and relationships between entities

• How to use them?

52

Page 53: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

PROPOSED RESEARCH

53

Page 54: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Outline: Proposed Research

Proposed Research

• Hierarchal Learning to Rank for Multiple Representations

– Handle the uncertainties and incorporate more evidence from entities

• Co-Learning: Entity Finding and Document Ranking

– Find better relevant entities

• Entity Graph Representation

– Utilize entity connections

54

Page 55: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Hierarchical Ranking with Multiple Representations

Two Layer Attention Based Ranking Model

• To incorporate more evidence, and to handle BOE’s uncertainty

• Learning to rank and combine multiple representations jointly

55

𝒇(𝒒, 𝒅)

Entity

Features

Word

Features

BOE

Uncertainty

Features

BOW

Uncertainty

Features

1st Layer:

Individual

Ranking Models

LeToR with

More Evidence

from Entities

Classic

LeToR

2nd Layer:

Attend the more

confident one

Final

Ranking

Page 56: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Co-Learning: Ranking and Entity Finding

Multi-task learning for entity finding and document ranking

• Finding relevant entities towards best ranking performance

• Seek supervision from both entity side and document side

56

Query Carnegie Mellon

Documents Entities

CMU

Computer

Pittsburgh

Tartans

LeToR

with

document

labels

LeToR

with

entity

labels

Propagate

information

round with

document-

entity

connections

Page 57: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Entity Graph Representation

57

From Individual Entities To Connected Entity Graph

Carnegie Mellon

University (CMU); is a

private research

university in Pittsburgh,

Pennsylvania. Founded

in 1900 by Andrew

Carnegie as the

Carnegie Technical

Schools…….

Carnegie Mellon

University

CMUUniversity

Pittsburgh

Andrew

Carnegie

Carnegie

Technical

Schools

Alias Is Located In

Founded ByUse to be

Entities

Connections

Terms

For Better Ranking

Promote Central Entities:

Carnegie Mellon

University

Structural Ranking

Query:

CMU Founder

Document:

CMU

Andrew Carnegie

Founded By

Alignment

Research

Page 58: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Entity Graph Representation

58

Entity Connections

• Closed Form

Relationship

• Open Information

Extraction

• Relationship Language

Model

Inference Methods

• CRF Topic Modeling

• Learn from Key

Phrases

• Learn from Manual

Labels

Ranking Methods

• BOE with Better

Node Weights

• Graph Ranking

Models

• Graph Embedding

Page 59: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Research Schedule

59

Query Representation: Document Representation:

Entity Graph Representation

• Node Based Inference

• Explore Different Connections

• Ranking Models

Co-Learning: Entity Finding

and Document RankingHierarchical Learning to Rank

with Multiple Representations

Dissertation Writing

& Job Hunting

08/2016

11/2016

03/2017

01/2018

06/2018 Planed Graduation Date

Page 60: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Conclusion

60

Carnegie MellonQuery Documents

Query Representation: Document Representation:

[ICTIR 2015] Term Based

• State-of-the-art in Unsupervised Ranking

[CIKM 2015] Entity Based

• State-of-the-art in Supervised Ranking

[SIGIR 2016] Find Related Entities

• State-of-the-art in Entity Search

[Under Review] Bag-of-Entities

• Boolean Retrieval works well already

Entity Graph RepresentationCo-Learning: Entity Finding

and Document Ranking

Hierarchical Learning to Rank

with Multiple Representations

March Towards more Intelligent, Semantic

and Structured Information Retrieval

Page 61: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

QUESTION & ANSWERING

61

Page 62: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

EsdRank: More Details

62

Page 63: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Latent-ListMLE: Learning

Learn parameters by Maximize the likelihood of best rank 𝑫∗:

𝑤∗, 𝜃∗ = argmax𝑤,𝜃 log 𝑝(𝐷∗|𝑞, 𝑤, 𝜃)

= argmax𝑤,𝜃

𝑖=1

𝑛

log

𝑗=1

𝑚

𝑝 𝑑𝑖 𝑜𝑗 , 𝑆𝑖 𝑝(𝑜𝑗|𝑞)

EM method is used to solve this problem:

• E step: find the posterior of latent objects

– “Propagate document level training data to latent layer”

• M step: maximize the expected complete likelihood

– “Fit the parameters on current belief”

• Iterate until converge

63

Page 64: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Latent-ListMLE: LearningE-Step

E step, find the posterior of latent variable given current 𝒘,𝜽:

𝜋 𝑜𝑗 𝑑𝑖 , 𝑞 = 𝑝 𝑜𝑗 𝑑𝑖 , 𝑞; 𝜃𝑜𝑙𝑑, 𝑤𝑜𝑙𝑑 =1

𝑍𝑝 𝑑𝑖 𝑜𝑗 , 𝑆𝑖; 𝑤𝑜𝑙𝑑 𝑝(𝑜𝑗|𝑞; 𝜃𝑜𝑙𝑑)

64

Normalizer Defined in the

generative process

Posterior of 𝑜𝑗

Page 65: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Latent-ListMLE: LearningM-Step

M step, find 𝒘∗, 𝜽∗ that maximizes:

𝐸𝜋(𝑂|𝐷∗,𝑞) log 𝑝(𝐷∗, 𝑂|, 𝑞; 𝑤, 𝜃)

=𝑖=1

𝑛

𝑗=1

𝑚

𝜋 𝑜𝑗 𝑑𝑖 , 𝑞 log 𝑝 𝑑𝑖 𝑜𝑗 , 𝑆𝑖 𝑝(𝑜𝑗|𝑞)

65

Consider 𝑜𝑗 as observed with

posterior probability 𝜋(𝑜𝑗|𝑑𝑖 , 𝑞)

Summations only outside log

Page 66: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Web Search Results: ClueWeb12-B13

66

Unsupervised Retrieval

Supervised + Freebase

Baselines:

Learning to Rank

Work well on ClueWeb12-B13

• AnnQ outperforms all baselines, too

• Fewer training queries, less improvements

• We have a more complex model

Page 67: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Entity Search: More Details

67

Page 68: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Prior State-of-the-Art

Fielded Documents + SDM

AliasName

TypesLocation

Contains

Name

Alias

Attributes

(All Other Attributes)

Related Entity

(Connected Entities’ Names)

Category

(Name of Types)

Virtual Document for the EntityRDF of an Entity

Nikita Zhiltsov et al. SIGIR 2015 68

Page 69: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

BOE: More Details

69

Page 70: Knowledge Based Text Representation for Information Retrievalcx/papers/chenyan_proposal_slides.pdf · •Learning to Find Entity [Related Entity Finding] (SIGIR 2016 Short) Document

© 2016 Chenyan Xiong

Error Analysis

Defining Factor: Uncertainties from Automatic Entity Linking

70

When the query annotation is correct, huge improvements

When the query annotation is wrong, may hurt