ranking web pages
DESCRIPTION
TRANSCRIPT
Web Search: Ranking Web Pages
Statistics
• In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents.
• Up to 09/2005 Google indexes 8,200,000,000 web pages.
Search Engines
• A search engine is a system which collects, organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents– Model the Web as a full-text DB
– Index a portion of the Web docs– Search Web documents using user-specified
words/patterns in a text
Search Engines
• Two categories of search engine– general-purpose search engine, e.g. Yahoo!,
AltaVista and Google– special-purpose search engines (or Internet
Portals), e.g. LinuxStart (www.linuxstart.com)
Search Engines
• Two main components of a search engine:– web crawler (spider), which collects massive Web
pages.– large database, which stores and indexes collected
Web pages.
• Ranking has to be performed without accessing the text, just the index
• Ranking algorithms: all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries
Search Engine
• Information Retrieval (IR) is a key to search engine or Web Search.
• Most commonly – used models: – Boolean Model – Vector Space Model (VSM) – Probability Model – their variations
Search Engine• The Google is a representative general-purpose search
engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features).
• The PageRank in Google is defined as follow: Assume page A has pages P1...Pn which point to it. The
parameter d is a damping factor which can be set between 0 and 1. Also C(Pi) is defined as the number of links going out of page Pi. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(P1)/C(P1) + ... + PR(Pn)/C(Pn))
Usually the parameter d is set to 0.85. PageRank or PR(A) can be calculated using a simple iterative algorithm.
• Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.
Ranking result pages
• Based on content– Based on content– Number of occurrences of the search terms
• Based on link structure– Backlink count– PageRank
• And more. (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)
Problems with content-based ranking
• Many pages containing search terms may be of poor quality or irrelevant– Example: a page with just a line “search engine”.
• Many high-quality or relevant pages do not even contain the search terms– Example: Google homepage
• Page containing more occurrences of the search terms are ranked higher; spamming is easy– Example: a page with line “search engine” Repeated
many times
Based on link structure
• Hyperlinks among web pages provide new web search opportunities.
• Our focus in this lecture– PageRank– HITS
Backlink
• A backlink of a page p is a link that points to p• A page with more backlinks is ranked higher• Intuition: Each backlink is a “vote” for the page’s
importance
• Based on local link structure; still easy to spam– Create lots of pages that point to a particular page
PageRank
• Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998
• Main idea: Pages pointed by high-ranking pages are ranked higher
• Definition is recursive by design
• Based on global link structure; hard to spam
PageRank
• Web can be viewed as a huge directed graph G(V, E)– where V is the set of web pages (vertices) and E is the set of
hyperlinks (directed edges).
• Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).
• Each backlink of a page represents a citation to the page.
• PageRank is a measure of global web page importance based on the backlinks of web pages.
Basic ideas• A page is likely to be important if it is linked by
many pages.• A page is likely to be important If it is linked to by
important pages– even though it may not necessarily be linked by a large
number of pages.
• The importance of a page is divided evenly and propagated to the pages pointed to by it.
8
4
4
PageRank Definition
• N(p): number of outgoing links from page p• B(p): set of pages that point to p• PageRank(p) = Σq∈B(p) (PageRank(q) / N(q))
• Intuition– Each page q evenly distributes its importance to all
pages that q points to– Each page p gets a boost of its importance from each
page that points to p
Computing PageRank
• Create a stochastic matrix M for the link structure– Each page i corresponds to row i and column i– If page j has n outgoing links, then the M(i, j) = 1 Ú n if page j
• Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge
)PageRank(p
...
)PageRank(p
)PageRank(p
)PageRank(p
...
)PageRank(p
)PageRank(p
m
2
1
m
2
1
M
Convergence
• If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1.
• Convergence is guaranteed only if– M is aperiodic (the Web graph is not a big
cycle). This is practically guaranteed for Web.
– M is irreducible (the Web graph is strongly connected). This is usually not true.
Example
p1
p2 p3
r1r3r2
0.5 0 0.50 0 0.5 0.5 1 0=
r1r3r2
2.1
6.0
2.1
,...,
0625.1
6875.0
25.1
,
1
75.0
25.1
,
5.1
5.0
1
,
1
1
1
2
3
1
r
r
r
The sum of all PageRank’s is the total number of pages
Random surfer model
• A random surfer– Starts with a random page– Randomly selects a link on the page to visit
next– Never uses the .back. button
• PageRank(p) measures the probability that a– random surfer visits page p
Rank leak
p1
p2 p3
r1r3r2
0.5 0 0.50 0 0.5 0.5 0 0=
r1r3r2
0
0
0
,...,
3125.0
1875.0
5.0
,
375.0
25.0
625.0
,
5.0
25.0
75.0
,
5.0
5.0
1
,
1
1
1
2
3
1
r
r
r
• Pages with no outgoing links• Called a rank leak because eventually all importance will leak. out of the Web
Rank sink
• A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.
• Causes the loss of total ranks.
• Eventually accumulate all importance of the Web
An example of rank sink
p1
p2 p3
r1r3r2
0.5 0 0.50 1 0.5 0.5 0 0=
r1r3r2
0
3
0
,...,
3125.0
1875.2
5.0
,
375.0
2
625.0
,
5.0
75.1
75.0
,
5.0
5.1
1
,
1
1
1
2
3
1
r
r
r
Solution
• d: damping factor• PageRank(p) =d · Σq∈B(p)
(PageRank(q) /N(q)) + (1 -d )• Intuition in the random surfer model
– A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link
• Another intuition– Make the graph fully connected
An example
p1
p2 p3
11/5
11/21
11/7
,...,
2
3
1
r
r
r
r1r3r2
0.5 0 0.50 1 0.5 0.5 0 0
=
r1r3r2
+0.20.20.2
the sum of all PageRank’s is still the total number of pages.