the pagerank citation ranking: bringing order to the web presented by aishwarya rengamannan...

23
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Upload: abraham-franklin

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

The PageRank Citation Ranking: Bringing Order to the Web

Presented byAishwarya Rengamannan

1000669605Instructor: Dr. Gautam Das

Page 2: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Technology Overview

Page 3: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Motivation

• WWW is huge and heterogeneous• WebPages proliferate free of quality control• Commercial interest to manipulate ranking• The ‘quality’ of a webpage is subjective to the

users.Problem: Necessity to approximate the overall

relative ‘importance’ of web pages. Solution: Take advantage of the Link Structure of

the web

Page 4: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Link structure of the Web

• Forward Links(Outedges):The outgoing links from a webpage. C is A & B’s forward link.

• Back Links(Inedges):Incoming links to a webpage. A & B are back links for C.

Page 5: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Related Work

• Academic paper citations• Link based analysis• Clustering methods that take link structure

into account• Modeling web as Hubs and Authorities

Page 6: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Ranking Intuition

• The quantity of the backlinks to a webpage makes it important.

• The quality of the back linked pages increases the ranking.

“A page has high rank if the sum of the ranks of it’s backlinks is high.”

How about having a backlink from www.yahoo.com?

Page 7: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Naïve PageRank Calculation

• u & v --> Webpages• Bu --> backlinks of u

• Nv --> Forward Links from v to u.• R --> Ranks of the webpages• c <1 --> Used for normalization

Page 8: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Matrix Representation

‘A’ is a square adjacency Matrix with• Rows and columns corresponding to web

pages (u & v)• Au,v = 1/Nu if there is an edge from u to v

• Au,v = 0 if there is no edge.

Page 9: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Matrices Revisited

Eigen Values and Eigen Vectors:• Matrix A (nXn)• is an Eigen value of A if there exists a non-zero

vector v such that Av= v• vector v is called an Eigen vector of A

corresponding to .• We can rewrite Av= v as (A− I)v=0, where I is

identity matrix (nXn).

Page 10: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Matrices Revisited(Contd…)

How to solve for Eigen value and Eigen Vector?

Page 11: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das
Page 12: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Sample Calculation

1 3

2 4

Page 13: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Matrix Representation (contd…)

• A --> square matrix of web pages• R --> vector over webpages• To find: Eigen Vector corresponding to dominant

(maximum) Eigen value.– Could be computed by repeatedly iterating till it

converges to the dominant Eigen value-Eigen Vector

Matrix Notation givesR = c A R

c : eigenvalueR : eigenvector of A

R =

Normalized R =

Page 14: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Problem with Naïve PageRankRank Sink:• Two web pages that point to each other but to

no other page. Third page which points to one of them.

• loop will accumulate rank but never distribute it (since there are no out edges).

Page 15: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Solution – Extended version of PageRank

Introducing Rank Source:

E(u): a vector over the web pages that corresponds to a source of rank.

Page 16: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Random Surfer Model

• Random Surfer – Clicks on successive links at random.

• The factor ‘E’ can be viewed as modeling this behavior.

• “Surfer” periodically gets bored, jumped to a random page based on E.

Page 17: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

PageRank Computation

- initialize vector over web pagesLoop:- new ranks sum of normalized backlink ranks- compute normalizing factor- add escape term- control parameterWhile - stop when converged

Page 18: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Another Problem?

Dangling links:– Links to a page with no link to any other pages– Not clear where their weights should be

distributed

Solution: Remove them from the system until after calculating all other PageRanks!

Page 19: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Implementation

• Web crawler keeps a database of URLs so that it can discover all URLs on the web

• To implement PageRank, the web crawler builds an index of the URLs as it crawls

Problems???

• Infinitely large sites• Incorrect/Broken HTML• Sites are down• Web is always changing

Page 20: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

PageRank Implementation

• Convert each URL into unique integer ID• Link structure sorted by the IDs• Remove dangling links• Make a initial assignment of ranks and iterate

until convergence• Add the dangling links back• Iterate the process again to assign weights to all

dangling links• Link database A, is normally kept in RAM

Page 21: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Convergence Properties

• Interpret web as a expander like graph.– if every subsets of nodes S has a neighborhood

that is larger than some factor α times |S|

• Verification - if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue

Page 22: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Applications of Page Rank

• Search, Browsing and Traffic estimation.• Help user decide if a site is trustworthy.• Estimate web traffic.• Spam detection and prevention.• Predict citation counts

Page 23: The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

• http://www.techpavan.com/2008/11/20/backend-google-search/

• http://www.math.hmc.edu/calculus/tutorials/eigenstuff/

• http://williamcotton.com/pagerank-explained-with-javascript