ranking web pages

24
Web Search: Ranking Web Pages

Upload: elliando-dias

Post on 28-Jan-2015

112 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Ranking Web Pages

Web Search: Ranking Web Pages

Page 2: Ranking Web Pages

Statistics

• In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents.

• Up to 09/2005 Google indexes 8,200,000,000 web pages.

Page 3: Ranking Web Pages

Search Engines

• A search engine is a system which collects, organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents– Model the Web as a full-text DB

– Index a portion of the Web docs– Search Web documents using user-specified

words/patterns in a text

Page 4: Ranking Web Pages

Search Engines

• Two categories of search engine– general-purpose search engine, e.g. Yahoo!,

AltaVista and Google– special-purpose search engines (or Internet

Portals), e.g. LinuxStart (www.linuxstart.com)

Page 5: Ranking Web Pages

Search Engines

• Two main components of a search engine:– web crawler (spider), which collects massive Web

pages.– large database, which stores and indexes collected

Web pages.

• Ranking has to be performed without accessing the text, just the index

• Ranking algorithms: all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries

Page 6: Ranking Web Pages

Search Engine

• Information Retrieval (IR) is a key to search engine or Web Search.

• Most commonly – used models: – Boolean Model – Vector Space Model (VSM) – Probability Model – their variations

Page 7: Ranking Web Pages

Search Engine• The Google is a representative general-purpose search

engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features).

• The PageRank in Google is defined as follow: Assume page A has pages P1...Pn which point to it. The

parameter d is a damping factor which can be set between 0 and 1. Also C(Pi) is defined as the number of links going out of page Pi. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(P1)/C(P1) + ... + PR(Pn)/C(Pn))

Usually the parameter d is set to 0.85. PageRank or PR(A) can be calculated using a simple iterative algorithm.

• Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.

Page 8: Ranking Web Pages

Ranking result pages

• Based on content– Based on content– Number of occurrences of the search terms

• Based on link structure– Backlink count– PageRank

• And more. (http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)

Page 9: Ranking Web Pages

Problems with content-based ranking

• Many pages containing search terms may be of poor quality or irrelevant– Example: a page with just a line “search engine”.

• Many high-quality or relevant pages do not even contain the search terms– Example: Google homepage

• Page containing more occurrences of the search terms are ranked higher; spamming is easy– Example: a page with line “search engine” Repeated

many times

Page 10: Ranking Web Pages

Based on link structure

• Hyperlinks among web pages provide new web search opportunities.

• Our focus in this lecture– PageRank– HITS

Page 11: Ranking Web Pages

Backlink

• A backlink of a page p is a link that points to p• A page with more backlinks is ranked higher• Intuition: Each backlink is a “vote” for the page’s

importance

• Based on local link structure; still easy to spam– Create lots of pages that point to a particular page

Page 12: Ranking Web Pages

PageRank

• Page et al., "The PageRank Citation Ranking: Brining Order to the Web.", 1998

• Main idea: Pages pointed by high-ranking pages are ranked higher

• Definition is recursive by design

• Based on global link structure; hard to spam

Page 13: Ranking Web Pages

PageRank

• Web can be viewed as a huge directed graph G(V, E)– where V is the set of web pages (vertices) and E is the set of

hyperlinks (directed edges).

• Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks).

• Each backlink of a page represents a citation to the page.

• PageRank is a measure of global web page importance based on the backlinks of web pages.

Page 14: Ranking Web Pages

Basic ideas• A page is likely to be important if it is linked by

many pages.• A page is likely to be important If it is linked to by

important pages– even though it may not necessarily be linked by a large

number of pages.

• The importance of a page is divided evenly and propagated to the pages pointed to by it.

8

4

4

Page 15: Ranking Web Pages

PageRank Definition

• N(p): number of outgoing links from page p• B(p): set of pages that point to p• PageRank(p) = Σq∈B(p) (PageRank(q) / N(q))

• Intuition– Each page q evenly distributes its importance to all

pages that q points to– Each page p gets a boost of its importance from each

page that points to p

Page 16: Ranking Web Pages

Computing PageRank

• Create a stochastic matrix M for the link structure– Each page i corresponds to row i and column i– If page j has n outgoing links, then the M(i, j) = 1 Ú n if page j

• Solve by setting all PageRank.s to 1 initially, and thenapplying M repeatedly until the values converge

)PageRank(p

...

)PageRank(p

)PageRank(p

)PageRank(p

...

)PageRank(p

)PageRank(p

m

2

1

m

2

1

M

Page 17: Ranking Web Pages

Convergence

• If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1.

• Convergence is guaranteed only if– M is aperiodic (the Web graph is not a big

cycle). This is practically guaranteed for Web.

– M is irreducible (the Web graph is strongly connected). This is usually not true.

Page 18: Ranking Web Pages

Example

p1

p2 p3

r1r3r2

0.5 0 0.50 0 0.5 0.5 1 0=

r1r3r2

2.1

6.0

2.1

,...,

0625.1

6875.0

25.1

,

1

75.0

25.1

,

5.1

5.0

1

,

1

1

1

2

3

1

r

r

r

The sum of all PageRank’s is the total number of pages

Page 19: Ranking Web Pages

Random surfer model

• A random surfer– Starts with a random page– Randomly selects a link on the page to visit

next– Never uses the .back. button

• PageRank(p) measures the probability that a– random surfer visits page p

Page 20: Ranking Web Pages

Rank leak

p1

p2 p3

r1r3r2

0.5 0 0.50 0 0.5 0.5 0 0=

r1r3r2

0

0

0

,...,

3125.0

1875.0

5.0

,

375.0

25.0

625.0

,

5.0

25.0

75.0

,

5.0

5.0

1

,

1

1

1

2

3

1

r

r

r

• Pages with no outgoing links• Called a rank leak because eventually all importance will leak. out of the Web

Page 21: Ranking Web Pages

Rank sink

• A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.

• Causes the loss of total ranks.

• Eventually accumulate all importance of the Web

Page 22: Ranking Web Pages

An example of rank sink

p1

p2 p3

r1r3r2

0.5 0 0.50 1 0.5 0.5 0 0=

r1r3r2

0

3

0

,...,

3125.0

1875.2

5.0

,

375.0

2

625.0

,

5.0

75.1

75.0

,

5.0

5.1

1

,

1

1

1

2

3

1

r

r

r

Page 23: Ranking Web Pages

Solution

• d: damping factor• PageRank(p) =d · Σq∈B(p)

(PageRank(q) /N(q)) + (1 -d )• Intuition in the random surfer model

– A surfer occasionally gets bored and jump to a random page on the Web instead of following a random link

• Another intuition– Make the graph fully connected

Page 24: Ranking Web Pages

An example

p1

p2 p3

11/5

11/21

11/7

,...,

2

3

1

r

r

r

r1r3r2

0.5 0 0.50 1 0.5 0.5 0 0

=

r1r3r2

+0.20.20.2

the sum of all PageRank’s is still the total number of pages.