sigir 2008 singapore jonathan elsas, jaime arguello, jamie callan & jaime carbonell lti/scs/cmu...

Post on 11-Jan-2016

219 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SIGIR 2008

Singapore

Jonathan Elsas, Jaime Arguello,

Jamie Callan & Jaime Carbonell

LTI/SCS/CMU

Retrieval and Feedback Models for Blog Feed Search

Outline

• The task– Overview of Blogs & Blog Search– Challenges in Blog Search

• Our approach– Retrieval Models– Query Expansion Models

• Conclusion

Background

What is a Blog?

What is a Feed?

<xml>

<feed>

<entry>

<author>Peter …</>

<title>Good, Evil…</>

<content>I’ve said…</>

</entry>

<entry>

<author>Peter …</>

<title>Agreeing…</>

<content>Some peo…</>

</entry>

Blog-Feed Correspondence

Blog Feed

Post Entry

HTMLHTML XMLXML

Why are Blogs important?

Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day

[http://www.technorati.com/about/]

The Task

Feed Search at TREC

Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]

“A relevant feed should have a principle and recurring interest in X”

— TREC 2007 Blog Track

(a.k.a. Blog Distillation)

Feed Search at TREC

[Gardening]

[Apple iPod]

[Violence in Sudan]

[Gun Control]

[Food]

[Wine]

RepresentOngoing

Information Needs

FrequentlyVery

General

Challenges in Feed Search

Challenges in Feed Search

entries

time

feed

1. A feed is a collection of documents

1. A feed is a collection of documents – How does relevance at the entry level

correspond to relevance at the feed level?

Challenges in Feed Search

entries

time

feed

Challenges in Feed Search2. Even a topical feed is topically diverse

time

NASA

China’s plans for the moon shuttle

launch

My dog

Mars rover

Boeing

Space Exploration

topic

Challenges in Feed Search2. Even a topical feed is topically diverse

– Can we favor entries close to the central topic of the feed?

Space Exploration

time

topic

Challenges in Feed Search3. Feeds are noisy

– Spam blogs, Spam & off topic comments

time

Challenges in Feed Search

4. General & Ongoing Information Needs

[Mac]

[Music]

[Food]

[Wine]

… post regularly about new products, features, or application software of Apple Mac computers.

… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.…such as tastings,

reviews, food matching or pairing, and oenophile news and events.

… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.

Our Approach

Retrieval Models

Feedback Models

Feeds:Topically Diverse

Noisy

Collections

Information Needs:

General & Ongoing

Challenges Our Approach

Retrieval Models

• Challenge: ranking topically diverse collections

• Representation: feed vs. entry• Model topical relationship between entries

Large Document (Feed) Model

<?xml……

</…>

`<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

<?xml……

</…>

<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

Feed Document Collection

[Q]

Ranked Feeds

Rank by

Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]

Large Document (Feed) ModelAdvantages:

• A straightforward application of existing retrieval techniques

Potential Pitfalls:

• Large entries dominate a feed’s language model

• Ignores relationship among entries

Feed

Entry E E Entry Entry E

Small Document (Entry) Model

<entry><entry><entry><entry><?xml…<entry>

Entry Document Collection

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

Ranked FeedsRanked Entriesdocument = entry

[Q]

Apply some rankaggregation function

Rank By

Small Document (Entry) Model

• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds

ReDDE Federated Search Algortihm[Si & Callan, 2003]

Entry Centrality

Uniform :

Geometric Mean :

time

topic

Small Document (Entry) Model

Advantages:

• Controls for differing entry length

•Models topical relationship among entries

Disadvantages:

• Centrality computation is slow(er)

Q

Not only improves speed, Also performance

Retrieval Model Results

Retrieval Model Results

• 45 Queries from the TREC 2007 Blog Distillation Task

• BLOG06 test collection, XML feeds only

• 5-Fold Cross Validation for all retrieval model smoothing parameters

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

LargeDocument

(Feed)Model

Small Document (Entry) Models

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

Uniform Log(Feed Length)UniformLog PriorMap 0.188

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

Uniform Log(Feed Length)Uniform

n/a

Feedback Models

• Challenge: Noisy collection with general & ongoing

information needs

• Use a cleaner external collection for query expansion (Wikipedia)

• With an expansion technique designed to identify multiple query facets

Query Expansion (PRF)

[Q]

BLOG06Collection

Related Terms from top K documents

[Q + Terms]

[Lavrenko & Croft, 2001]

Query Expansion Example

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF

Query Expansion (Wikipedia PRF)

[Q]

BLOG06Collection

[Q + Terms]

[Lavrenko & Croft, 2001]

Wikipedia

[Diaz & Metzler, 2006]

Related Terms from top K documents

Query Expansion Example

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Wikipedia PRFphotography

directorspecial

filmart

cameramusic

cinematographerphotographic

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF Wiki. PRF

Query Expansion (Wikipedia Link)

[Q]

BLOG06Collection

[Q + Terms]

Wikipedia

Related Terms from link structure

Wikipedia Link-BasedQuery Expansion

Wikipedia Link-Based ExpansionWikipedia

Q

Wikipedia Link-Based Expansion

Wikipedia

Relevance Set, Top R = 100

Working Set, Top W = 1000

Q

Wikipedia Link-Based Expansion

Wikipedia

Q

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia Link-Based Expansion

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Wikipedia Link-Based Expansion

Relevance Set, Top R = 500

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Combines relevance and popularity

Relevance: An anchor phrase that links to a high ranked article gets a high score

Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score

Query Expansion Example

Wikipedia Link-Basedphotographyphotographer

digital photographyphotographicdepth of field

feature photographyfilm

photographic filmphotojournalism

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

0.4

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF Wiki. PRF Wiki. Link

Conclusion

• Feed Search Challenges:

– Feeds are topically diverse, noisy collections

– Ranked against ongoing & general information needs

• Novel Retrieval Models:

– Ranking collections, sensitive to topical relationship among entries

• Novel Feedback Models:

– Discover multiple query facets & robust to collection noise

Thank You!

Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research

Entry Centrality GM Derivation

where

Entry Generation Likelihood:

|E|

Query Expansion Examples

Wikipedia ExpansionMusic

Folk musicElectronic music

FolkMusic videoWorld music

AmbientElectronic

Country music

[Music]

PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong

Query Expansion Examples

Wikipedia Expansionscotland

scottish parliamentscottish

scottish national party wars of scottish independence

scottish independencewilliam wallace

glasgowscottish socialist party

[Scottish Independence]

PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot

Query Expansion Examples

Wikipedia Expansionmachine learning

learningartificial intelligence

turing machine machine gun

neural networksupport vector machine

supervised learningartificial neural network

[Machine Learning]

PRFlearnmachinecreditcardkaraokejournalsexmodelsew

Query Generality Characteristics

• Query Length:

– BLOG: 1.9 words

– TB04: 3.2 words

– TB05: 3.0 words

• ODP Depth

– BLOG: 4.7 levels

– TB04: 5.2 levels

– TB05: 5.3 levels

Relevance Set Cohesiveness

Wikipedia

Relevance Set, Top R = 100 Cohesiveness

=| Lin |

| Lin U Lout |

Relevant Set Cohesiveness

Is it the Queries?

Feed Search Queries

TB Adhoc QueriesBut, none of these measurespredict whether wikipedia

expansions helps…

top related