center for e-business technology seoul national university seoul, korea browserank: letting the web...
TRANSCRIPT
Center for E-Business TechnologySeoul National University
Seoul, Korea
BrowseRank: letting the web users vote for page importance
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li
SIGIR 2008
2009. 04.10.
Summarized & presented by Babar Tareen, IDS Lab., Seoul National University
Copyright 2008 by CEBT
Introduction
Page importance is a key factor for web search
Currently page importance is measured by using the link graph
HITS
PageRank
If many important pages link to a page then the page is also likely to be important
2
Copyright 2008 by CEBT
PageRank Drawbacks
3
Link graph is not reliable
Links can easily be created and deleted on the web
Can easily be manipulated by web spammers using link farms
PageRank does not considers the length of time which a web surfer spends on the web page
Copyright 2008 by CEBT
BrowseRank
Utilize user browsing graph
Generated from user behavior data
Behavior data can be recorded by Internet browsers at web clients and collected at web servers
Behavior data includes
– URL
– Time
– Method of visiting (URL input or hyperlink click)
4
Copyright 2008 by CEBT
BrowseRank (2)
More visits of the page and longer time spent on a page indicates that the page is important
Uses continuous-time Markov process as model on user browsing graph
Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states
5
Past
Present
Fu-ture
Copyright 2008 by CEBT
Originality
Propose the use of browsing graph for computing page importance
Propose the use of continuous-time Markov process to model a random walk on the user browsing graph
6
Copyright 2008 by CEBT
User Behavior Data
When user surfs on the web
Can input the URL
Choose to click on a hyperlink
Behavior data can be stored as triples
<URL, TIME, TYPE>
7
Copyright 2008 by CEBT
User Behavior Data (2)
Session Segmentation
Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session
Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session
URL Pair construction
Within session, URL’s are placed in adjacent records
Indicates that the user transits from the first page to the second page
8
Copyright 2008 by CEBT
User Behavior Data (3)
Reset probability estimation
For sessions segmented by type rule, the first URL is input by the user
Assign reset probabilities to those URL’s
Staying time extraction
For each URL pair, use the time difference of second and first page as staying time
For last session either use random time [for time rule] or time difference from next session [for type rule]
9
Copyright 2008 by CEBT
User Browsing Graph
Vertex: Represent a URL
Metadata: Reset Probabilities, Staying Time
Directed Edge: Represents Transition between pages
Edge Weight: Number of transitions
10
2518
30
3
45
15 7
6
17
14
Copyright 2008 by CEBT
Model
Continuous-time time-homogeneous Markov Process model
Assumptions
Independence of users ad sessions
Markov Property
Time-homogeneity
11
Copyright 2008 by CEBT
Continuous-time Markov Model
12
Xs represents page which the surfer is visiting at time s, s > 0
Continuous-time time-homogenous Markov Process
Pij(t) denotes the transition probability from page i to
page j for time interval t
Stationary probability distribution Π unique and independent of t
Computing matrix P is difficult because it is hard to get information for all time intervals
Algorithm is based on
}0,{ SXX s
0),( ttP
)0('PQ
Copyright 2008 by CEBT
Experiments
Website-Level BrowseRank
Finding important websites and depressing spam sites
Page-Level BrowseRank
Improving relevance ranking
Dataset
3 billion records
950 million unique URL’s
Website Level Graph
– 5.6 million vertices
– 53 million edges
– 40 million websites
14
Copyright 2008 by CEBT
Page Level Testing
17
Adopted 3 measures to evaluate performance
MAP
Precission (P@n)
Normalized Discounted Cummulative Gain (NDCG@n)
Copyright 2008 by CEBT
Technical Issues
User behavior data tends to be sparse
User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages
Time homogeneity assumption is mainly for technical convenience
Content information and metadata was not used in BrowseRank
20