ranking the web frontier

16
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang

Upload: harriet-pittman

Post on 30-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

Ranking the Web Frontier. Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang. Introduction & Contribution. Propose algorithmic innovations for the basic PageRank paradigm. Problem of Web Frontier ( Dangling Nodes) - PowerPoint PPT Presentation

TRANSCRIPT

Nadav Eiron, Kevin S.McCurley, JohA.Tomlin

IBM Almaden Research CenterWWW’04

CSE 450 Web MiningPresented by Zaihan Yang

Introduction & ContributionPropose algorithmic innovations for the basic PageRank

paradigm.

Problem of Web Frontier ( Dangling Nodes)Distinguish different types of Dangling NodesPropose four techniques for penalty pages

Problem of computing pagerank and rank manipulationExplore Web hierarchical structureHostRank & DirRank algorithms

PageRankBackLinks & Random surfer & Recursive computationIdeal Model

or

• The web graph should be strongly connected.• A should be stochastic. (irreducible and aperiodic)

PageRank

Improved Model

Add a link from each page to every page and give each link a small transition probability controlled by a parameter α. Random Jump (teleportation)

virtual node n+1

Variations IssuesParameter α.Random jump---uniform distributionDangling Nodes

Dangling NodesDangling nodes: Nodes that either have no outlinks or for which no

outlinks are known.How do pages become dangling nodes Crawlers might not have crawled them. Dynamic Pages. Protected by a robots.txt Genuinely have no outlinks: PS, PDF Meta tag indicating not to follow.

Handling Dangling NodesRemove away and then added back.Random jump

Reduced eigen-system. Power-iteration.A single step

Penalty Pages and Link RotPenalty pages: pages that are dangling and produce 403

or 404 HTTP code. Link Rot: links used to work but then broken.

(Penalty Link, Dangling Link)

Effects of Dangling Nodes on RankingWhether teleportation to dangling nodes.Yes. 3 has the highest rank score. No. [0.31746, 0.31746, 0.365079], 0.269841. Less than 1and 2.

The number of dangling links. 1 link: [0.198684, 0.283124, 0.283124, 0.235068]4 links: [0.195954, 0.229266, 0.279234, 0.29554]

Push-back algorithmIf a page has a link to a penalty page, have its rank

reduced by a fraction, and the excess rank should be returned to the pages that pushed rank to it in the previous iteration.

Retain (1-i), distribute iij to its backlinks.

Self-Loop algorithmAugment each page with a self-loop link to itself . With a

i probability follow this link.

bi is the number of outlinks from i to penalty pages.

gi is the number of outlinks from i to non-penalty pages.

1- becomes Some variations.

Jump-weighting algorithmInstead of evenly redistribution, biasing the redistribution

so that penalized pages receive less rank. A straight-forward methodWeight the link from virtual nodeto an unpenalized node in C (strongly connected node set)

by to a penalized node by gi/(gi+bi)

BHITS algorithmRandom walk in both Forward/Backward directions. Forward step: the same as ordinary PageRank.Backward step:

Non-dangling nodes: self-loop. Dangling nodes:

non-penalty nodes: forward score to virtual node. penalty nodes: divide score by # of inlinks. Equally propagate

score among backward links. Penalty page traverse to a random seed nodes.

Matrix representation

HostRank algorithmWeb Hierarchical Structure

62.4% links are internal to a site. 82% outlinks are to the top level of sites.

Not jump uniformly, but to portal or Top-level pages.

Consider all pages on a site as a single body.Assign them all a rank based on the collective value of

information on that site. Each site represented by one node in the graph. Web size becomes smaller. Computation become less.

DirRank algorithmHostRank

too coarse a level of granularity & heavy tail distribution. DirRank graphNode:

groups of URLS with prefixes up to the last “/” or “?”. Virtual directory.

Edges:

if there is a link from a URL in the source virtual directory to a URL in the destination virtual directory.

Experiments ResultsSetup:

Crawling on IBM AlmadenMore than 1 billion pages; 37 billion links; 4.75 billion

URLS.Results: Reduce computation.

DirRank: 114 million nodes/15 billion edgesHostRank: 19.7 billion hosts(nodes)/1.1 billion edges

Enhance resistance to link manipulation.11/20 in 100 million pages. vs 14/100 hostnamesVirtual node probability : 0.82 vs 0.17

ConclusionsPageRank with uniform teleportation are easily subject to

link manipulation. HostRank and DirRank algorithm are both cheaper to

compute and less subject to link manipulation. The proposed 4 techniques for penalty pages can reduce

bias and improve ranking performance.In the future, hope can place the problem of web page

ranking on a firmer scientific foundation besides on trade or economic domains.