center for e-business technology seoul national university seoul, korea browserank: letting the web...

Center for E-Business TechnologySeoul National University

Seoul, Korea

BrowseRank: letting the web users vote for page importance

Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li

SIGIR 2008

2009. 04.10.

Summarized & presented by Babar Tareen, IDS Lab., Seoul National University

Copyright 2008 by CEBT

Introduction

Page importance is a key factor for web search

Currently page importance is measured by using the link graph

HITS

PageRank

If many important pages link to a page then the page is also likely to be important

2


PageRank Drawbacks

3

Link graph is not reliable

Links can easily be created and deleted on the web

Can easily be manipulated by web spammers using link farms

PageRank does not considers the length of time which a web surfer spends on the web page


BrowseRank

Utilize user browsing graph

Generated from user behavior data

Behavior data can be recorded by Internet browsers at web clients and collected at web servers

Behavior data includes

– URL

– Time

– Method of visiting (URL input or hyperlink click)

4


BrowseRank (2)

More visits of the page and longer time spent on a page indicates that the page is important

Uses continuous-time Markov process as model on user browsing graph

Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states

5

Past

Present

Fu-ture


Originality

Propose the use of browsing graph for computing page importance

Propose the use of continuous-time Markov process to model a random walk on the user browsing graph

6


User Behavior Data

When user surfs on the web

Can input the URL

Choose to click on a hyperlink

Behavior data can be stored as triples

<URL, TIME, TYPE>

7


User Behavior Data (2)

Session Segmentation

Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session

Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session

URL Pair construction

Within session, URL’s are placed in adjacent records

Indicates that the user transits from the first page to the second page

8


User Behavior Data (3)

Reset probability estimation

For sessions segmented by type rule, the first URL is input by the user

Assign reset probabilities to those URL’s

Staying time extraction

For each URL pair, use the time difference of second and first page as staying time

For last session either use random time [for time rule] or time difference from next session [for type rule]

9


User Browsing Graph

Vertex: Represent a URL

Metadata: Reset Probabilities, Staying Time

Directed Edge: Represents Transition between pages

Edge Weight: Number of transitions

10

2518

30

3

45

15 7

6

17

14


Model

Continuous-time time-homogeneous Markov Process model

Assumptions

Independence of users ad sessions

Markov Property

Time-homogeneity

11


Continuous-time Markov Model

12

Xs represents page which the surfer is visiting at time s, s > 0

Continuous-time time-homogenous Markov Process

Pij(t) denotes the transition probability from page i to

page j for time interval t

Stationary probability distribution Π unique and independent of t

Computing matrix P is difficult because it is hard to get information for all time intervals

Algorithm is based on

}0,{ SXX s

0),( ttP

)0('PQ


Algorithm

13


Experiments

Website-Level BrowseRank

Finding important websites and depressing spam sites

Page-Level BrowseRank

Improving relevance ranking

Dataset

3 billion records

950 million unique URL’s

Website Level Graph

– 5.6 million vertices

– 53 million edges

– 40 million websites

14


Top-20 Websites

15


Spam fighting

2714 websites labeled spam by human experts

16


Page Level Testing

17

Adopted 3 measures to evaluate performance

MAP

Precission (P@n)

Normalized Discounted Cummulative Gain (NDCG@n)


Results (1)

18


Results (2)

19


Technical Issues

User behavior data tends to be sparse

User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages

Time homogeneity assumption is mainly for technical convenience

Content information and metadata was not used in BrowseRank

20


Discussion

Better approach to find page importance

Already highlights technical issues

Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client.

21

center for e-business technology seoul national university seoul, korea browserank: letting the web...

Documents

web page copyright

random time

time difference

time s

length of time

time extractionfor

longer time

time of current record