“ the initiative's focus is to dramatically advance the means to collect,store,and organize...

15
The Page Rank Citation Ranking: Bringing Order to the Web Larry Page, Sergey Brin, Rajeev Motwani, Terry Winograd January 29, 1988 Speaker: AMAN BAKSHI University of Southern California “ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and processing via communication networks -- all in user- friendly ways “ ---quote from the DLII website

Upload: barnard-baldwin

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

The Page Rank Citation Ranking: Bringing Order to the Web

Larry Page, Sergey Brin, Rajeev Motwani, Terry Winograd

January 29, 1988

Speaker: AMAN BAKSHI University of Southern California

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digitalforms,and make it available for searching,retrieval,and processing via communication networks -- all in user-friendly ways “

---quote from the DLII website

2

Behind the wheels : Google Search

•When you search a keyword(s), you do not search on web.

•Instead you search Google's index of the web.

• This is done through spiders which traverse through hundreds of thousands of pages on web to narrow down results. Then it uses page rank to display top ones.

3

Introduction and Motivation

WWW is very large and heterogeneous The web pages are extremely diverse in

terms of content, quality and structure Challenging for information retrieval on

WWW. Most web pages link to web pages as well So, take advantage of the link structure

of the Web to produce ranking of every web page known as PageRank.

4

The Mechanics

A Google bot comes periodically to do two things:1. check authority of your site2. Relevance of your site

For relevance, it does following: 1. On page factors: searches for keywords on your page so have them in title, head or body. have a fresh content.

2. Off page factors : who is linking to your site The value is not linear. Its logarithmic.

Relevance is imp. For example a site say Baby food pointing to fish Fly makes no sense. So have pointing from a site which is ranked high

5

Theory and Analogy Behindo We can relate it directly to the way a painter paints on

a canvas. To get a specific color, he mixes different colors. The amount and intensity of each color you mix ultimately governs the color of the final mixture NOT the number of colors !!!

o Say a certain back link came from Yahoo! and another came from an obscure home page.

o Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’.

Backlinks (inedges) : Links that point to a certain page.

Forward Links (outedges): Links that emanate from that page

We can never know all the backlinks of a page, but we know all of its forward links

6

The Formula

Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu |

R() = Rank of page u ; c = Normalization Constant› Note: c < 1 to cover for pages with no

outgoing links

7

RepresentationA is designated to be a matrix, u and v correspond to the columns of this matrix

AT

=

8

The transition matrix A =

We get the eigenvalue λ = 1

Calculating the eigenvector

Computing Page Rank given a Directed Graph

9

ProblemsProblem 1: Dangling Links Dangling links are links that point to any page with

no outgoing links or pages not downloaded yet.

Problem : how their weights should be distributed. Solution 1: they are removed from the system until

all the PageRanks are calculated. Afterwards, they are added in without affecting things significantly

10

Problems (contd..)Problem 2: Rank Sink

Problem: Some pages form a loop that accumulates rank (rank sink) to the infinity.

Solution:Random Surfer ModelJump to a random page based on some distribution E (rank source)

11

Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies

such that c is maximized and ||R’||1 = 1 (where||R’||1 denotes the L1 norm of R’).

PageRank of document u

Number of outlinks from document v

PageRank of document vthat links to u

Normalizationfactor

Vector of web pages that the Surfer randomly jumps to u

Page Rank Expression

12

Searching with Page Rank

• Two search engines:– Title-based search engine– Full text search engine

• Title-based search engine– Searches only the “Titles”– Finds all the web pages whose titles contain all the query

words– Sorts the results by PageRank– Very simple and cheap to implement– Title match ensures high precision, and PageRank ensures

high quality

• Full text search engine– Called Google– Examines all the words in every stored document and also

performs PageRank (Rank Merging)– More precise but more complicated

13

First, it shows that most pages in the web converge to their true PageRank quickly, while relatively few pages take much longer to converge. Further , slow-converging pages generally have high PageRank, and those pages that converge quickly generally have low PageRank.

Adaptive Measures for computation of Page Rank

Second, the authors develop two algorithms, called Adaptive PageRank and Modified Adaptive PageRank, that exploit this observation to speed up the computation of PageRank by 18% and 28%, respectively.

This paper presents two contributions:

14

Observationsbmw.de banned from Google in early 2006 due

to its doorway page~ is a page stuffed full of keywords that the site

feels a need to be optimized forblog: http://blog.outer-court.com/archive/2006-02-04-

n60.html

•“Google Bomb”http://searchengineland.com/070125-230048.php

create lots of links to one certain destination,

label all of them with the same remarkableterms

query Google for those terms You will get the linked page Unwanted Uses ofPageRank

15

Estimating Web TrafficOn analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank.e.g.: Links to pirated software

PageRank as Backlink PredictorThe goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank according to an evaluation function. PageRank is a better predictor than citation counting

User Navigation: The PageRank ProxyThe user receives some information about the link before they click on it. This proxy can help users decide which links are more likely to be interesting

“If an SEO creates deceptive or misleading content on your behalf, such as doorway pages or ’throwaway’ domains, your site could be removed entirely from Google’s index.” ---- unknown at Google

Page rank is ONLY for the page. But there is nothing like Domain rank.

Applications