“ the initiative's focus is to dramatically advance the means to collect,store,and organize...
TRANSCRIPT
The Page Rank Citation Ranking: Bringing Order to the Web
Larry Page, Sergey Brin, Rajeev Motwani, Terry Winograd
January 29, 1988
Speaker: AMAN BAKSHI University of Southern California
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digitalforms,and make it available for searching,retrieval,and processing via communication networks -- all in user-friendly ways “
---quote from the DLII website
2
Behind the wheels : Google Search
•When you search a keyword(s), you do not search on web.
•Instead you search Google's index of the web.
• This is done through spiders which traverse through hundreds of thousands of pages on web to narrow down results. Then it uses page rank to display top ones.
3
Introduction and Motivation
WWW is very large and heterogeneous The web pages are extremely diverse in
terms of content, quality and structure Challenging for information retrieval on
WWW. Most web pages link to web pages as well So, take advantage of the link structure
of the Web to produce ranking of every web page known as PageRank.
4
The Mechanics
A Google bot comes periodically to do two things:1. check authority of your site2. Relevance of your site
For relevance, it does following: 1. On page factors: searches for keywords on your page so have them in title, head or body. have a fresh content.
2. Off page factors : who is linking to your site The value is not linear. Its logarithmic.
Relevance is imp. For example a site say Baby food pointing to fish Fly makes no sense. So have pointing from a site which is ranked high
5
Theory and Analogy Behindo We can relate it directly to the way a painter paints on
a canvas. To get a specific color, he mixes different colors. The amount and intensity of each color you mix ultimately governs the color of the final mixture NOT the number of colors !!!
o Say a certain back link came from Yahoo! and another came from an obscure home page.
o Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’.
Backlinks (inedges) : Links that point to a certain page.
Forward Links (outedges): Links that emanate from that page
We can never know all the backlinks of a page, but we know all of its forward links
6
The Formula
Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu |
R() = Rank of page u ; c = Normalization Constant› Note: c < 1 to cover for pages with no
outgoing links
7
RepresentationA is designated to be a matrix, u and v correspond to the columns of this matrix
AT
=
8
The transition matrix A =
We get the eigenvalue λ = 1
Calculating the eigenvector
Computing Page Rank given a Directed Graph
9
ProblemsProblem 1: Dangling Links Dangling links are links that point to any page with
no outgoing links or pages not downloaded yet.
Problem : how their weights should be distributed. Solution 1: they are removed from the system until
all the PageRanks are calculated. Afterwards, they are added in without affecting things significantly
10
Problems (contd..)Problem 2: Rank Sink
Problem: Some pages form a loop that accumulates rank (rank sink) to the infinity.
Solution:Random Surfer ModelJump to a random page based on some distribution E (rank source)
11
Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies
such that c is maximized and ||R’||1 = 1 (where||R’||1 denotes the L1 norm of R’).
PageRank of document u
Number of outlinks from document v
PageRank of document vthat links to u
Normalizationfactor
Vector of web pages that the Surfer randomly jumps to u
Page Rank Expression
12
Searching with Page Rank
• Two search engines:– Title-based search engine– Full text search engine
• Title-based search engine– Searches only the “Titles”– Finds all the web pages whose titles contain all the query
words– Sorts the results by PageRank– Very simple and cheap to implement– Title match ensures high precision, and PageRank ensures
high quality
• Full text search engine– Called Google– Examines all the words in every stored document and also
performs PageRank (Rank Merging)– More precise but more complicated
13
First, it shows that most pages in the web converge to their true PageRank quickly, while relatively few pages take much longer to converge. Further , slow-converging pages generally have high PageRank, and those pages that converge quickly generally have low PageRank.
Adaptive Measures for computation of Page Rank
Second, the authors develop two algorithms, called Adaptive PageRank and Modified Adaptive PageRank, that exploit this observation to speed up the computation of PageRank by 18% and 28%, respectively.
This paper presents two contributions:
14
Observationsbmw.de banned from Google in early 2006 due
to its doorway page~ is a page stuffed full of keywords that the site
feels a need to be optimized forblog: http://blog.outer-court.com/archive/2006-02-04-
n60.html
•“Google Bomb”http://searchengineland.com/070125-230048.php
create lots of links to one certain destination,
label all of them with the same remarkableterms
query Google for those terms You will get the linked page Unwanted Uses ofPageRank
15
Estimating Web TrafficOn analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank.e.g.: Links to pirated software
PageRank as Backlink PredictorThe goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank according to an evaluation function. PageRank is a better predictor than citation counting
User Navigation: The PageRank ProxyThe user receives some information about the link before they click on it. This proxy can help users decide which links are more likely to be interesting
“If an SEO creates deceptive or misleading content on your behalf, such as doorway pages or ’throwaway’ domains, your site could be removed entirely from Google’s index.” ---- unknown at Google
Page rank is ONLY for the page. But there is nothing like Domain rank.
Applications