using blog properties to improve retrieval

23

Upload: yan

Post on 17-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Using Blog Properties to Improve Retrieval. Gilad Mishne ISLA, University of Amsterdam ICWSM’2007. Introduction. the TREC Blog Track (2006) TREC featured, for the first time, a track dedicated to blog retrieval opinion retrieval task - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Blog Properties to Improve Retrieval
Page 2: Using Blog Properties to Improve Retrieval

the TREC Blog Track (2006)› TREC featured, for the first time, a track

dedicated to blog retrieval› opinion retrieval task

participants were requested to locate blog posts expressing an opinion about a topic in a large collection of posts

Page 3: Using Blog Properties to Improve Retrieval

approach to opinion retrieval task› identified three aspects involved in locating

opinionated blog posts topical relevance

the degree to which a post deals with the given topic similar to relevance as defined for ad-hoc retrieval

tasks opinion expression

the degree to which it contains subjective information about a topic

post quality estimation of the quality of a blog post query-independent detection of spam in blogs

spam blog post → low-quality one

Page 4: Using Blog Properties to Improve Retrieval

using a wide range of techniques each technique resulted in a separate relevance

score for each blog post› standard information retrieval approaches

resulted in a ranking of posts by their topical relevance to a query

› sentiment analysis used to rank all posts by the amount of sentiment

contained in them› Spam filtering

used to rank all posts by their estimated spam level final ranking of a blog post was obtained by

combining the partial scores assigned to it by the different approaches using a linear combination

one of the top performers at TREC

Page 5: Using Blog Properties to Improve Retrieval

three techniques use properties which are specific to the blogspace› one from each of the high-level aspects› based on a straightforward, inexpensive

approach each of these techniques improve over a

baseline combined with other techniques we use,

they improve also over state-of-the-art

Page 6: Using Blog Properties to Improve Retrieval

1. uses the timelined nature of blogs to identify periods of increased possible relevance

2. relates the amount of comments in a blog posts and the likelihood of an opinion being present in the post

3. uses query-dependent spam filtering to reduce noise in the collection

Page 7: Using Blog Properties to Improve Retrieval

relevant documents are found mostly in the few days following the event

Page 8: Using Blog Properties to Improve Retrieval

[Li and Croft] assigning higher relevance to recent documents› testing the temporal distribution of query

terms› using a decay function from detected peaks in

the aggregated distribution propose a simpler, more intuitive method

› assume that highly-ranked documents (according to topical similarity) are relevant

› 在這些 highly-ranked documents 中 , 用它們的date distribution 來估計所有 relevant doc. 的 date distribution 當作 doc. 相關度的 prior prob.

Page 9: Using Blog Properties to Improve Retrieval

more noisy than that of the actual relevant posts

but the peak around the time of the event is preserved

Page 10: Using Blog Properties to Improve Retrieval

assume that blog posts published near the time of an event are more likely to be relevant to it› regardless of their topical similarity

as with content-based blind relevance feedback› allows to identify relevance also for documents

which do not contain the query words this temporal prior likelihood is then

combined with its topical retrieval relevance using a linear combination to rerank

Page 11: Using Blog Properties to Improve Retrieval

the presence of comments in a blog post indicates that some discussion is taking place regarding the contents of the post

as blogs comments are currently, to a large degree, not syndicated› extracting them is a laborious task› involving parsing the HTML contents and

identifying the comments in it Goal: estimate dispute by the presence of

comments› limit to identifying the number of comments

Page 12: Using Blog Properties to Improve Retrieval

In the collection used at TREC, both the HTML content and the syndicated (RSS or Atom) content of each post are provided separately

兩種版本長度的差 , 可以看出 HTML 版本多出的資訊量 ( 包含 layout, content 兩種 information)› one of the types of content present in HTML but

not in the syndication is the comments and trackbacks of the post

› the amount of non-comment overhead involved is fixed over multiple posts in the same blog

Page 13: Using Blog Properties to Improve Retrieval

the discussion level calculate for a given post p’

› pXML : the syndicated content of a post p from blog B› pHTML : corresponding HTML content

1 : posts which have the same amount of comments as other posts in the same blog

<1 : attract less discussion >1 : posts with an above-average volume of

responses query-independent

reflecting the relative volume of comments of the post

Page 14: Using Blog Properties to Improve Retrieval

compares with the manually extracted number of comments of a few posts in two different blogs

Page 15: Using Blog Properties to Improve Retrieval

Spam blogs account for as much as 20% of all blogs› significantly degrade the performance of

blog search and analytics implement a spam filter for blogs

based on support vector machine classification of the contents of the blogs as well as other features → use it to rerank retrieval results

improvement : MAP 9%

Page 16: Using Blog Properties to Improve Retrieval

the contribution of the spam filter to different queries varied substantially› Out of a total of 50 queries, spam filtering

increased retrieval accuracy for 39 queries and decreased it for 10

commercially-oriented

sports, politics,…

Page 17: Using Blog Properties to Improve Retrieval

假設可以預測一個 query 是 commercially-oriented 的程度 , 可能可以提升 average performance› high commercial queries → 增加 spam filter 的重要性› non-commercial queries → 減少

measure the likelihood of observing the query terms in a commercial context› simply count the number of documents in which the

query terms co-occur with a term highly correlated with commercial activity use the word “shop”

query commercial-intent value› 包含 query term 和” shop” 之 doc.# / 包含 query term

之 doc.#

Page 18: Using Blog Properties to Improve Retrieval

query commercial intent value› used as a query-specific weight for spam

reranking instead of using a fixed, query-independent weight

Page 19: Using Blog Properties to Improve Retrieval

Experimental settings› based on the data used at the TREC 2006 Blog

Track : TREC Blog06 corpus› crawl of more than 100,000 syndicated feeds › over a period of 11 weeks› 3.2 million posts› total size of 148GB

including both HTML and syndicated content 50 queries were used at the track, with

63,103 blog posts being manually assessed for relevance

Page 20: Using Blog Properties to Improve Retrieval

baseline› use the language modeling-based retrieval

approach proposed by Hiemstra› MAP score : 0.1797

median MAP score at TREC 2006 : 0.1371

Page 21: Using Blog Properties to Improve Retrieval

All results are statistically significant (p < 0.05) using the t-testAll results are statistically significant (p < 0.05) using the t-test

indicating the approaches are orthogonal, and improve the ranking of different types of documentsindicating the approaches are orthogonal, and improve the ranking of different types of documents

combination shows substantial improvementscombination shows substantial improvements

the best-performing system at TREC 2006 obtained a MAP score of 0.2052

Page 22: Using Blog Properties to Improve Retrieval

three techniques for improving opinion retrieval in blogs› each based on a property of the blogspace and

each implemented with a simple, inexpensive mechanism

correspond to different aspects of opinionated retrieval› improve topical retrieval

incorporated temporal information into the retrieval model via a blind relevance feedback approach

› improve identification of opinionated content› some queries require more rigorous spam filtering

assign varying levels of spam detection according to a query’s commercial context

Page 23: Using Blog Properties to Improve Retrieval

Combined with other, more traditional techniques for the task of opinion retrieval, these approaches substantially improve over a baseline and over state-of-the-art

future work› incorporate additional properties of the

blogspace into the retrieval model› sentiment analysis methods with language

properties typical to blogs