ranking the web

45
Ranking the Web Gianna M. Del Corso Antonio Gullí Dipartimento Informatica, Pisa IIT-CNR, Pisa

Upload: pisces

Post on 22-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Ranking the Web. Gianna M. Del CorsoAntonio Gull í Dipartimento Informatica, Pisa IIT-CNR, Pisa. Overview. Web Statistics Some Web Ranking Algorithms Zooming on PageRank Personalization Fast PageRank Fun Results and Web Comparison Online demo. Web Statistics. Web Statistics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ranking the Web

Ranking the Web

Gianna M. Del Corso Antonio Gullí

Dipartimento Informatica, PisaIIT-CNR, Pisa

Page 2: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Overview

Web Statistics Some Web Ranking Algorithms Zooming on PageRank

Personalization Fast PageRank

Fun Results and Web Comparison Online demo

Page 3: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Web Statistics

Page 4: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Web Statistics

January 2004, 151 millions active in the U.S. 76% used a SE at least once a month. Time spent searching ~ 40 mins.

[Nielsen//NetRatings]

Page 5: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Share Of Searches: February 2004

February 2004 1.5Millions US

web surfers

[comScore Media Metrix]

AG
is gathered by monitoring the web activities of 1.5 million English-speakers worldwide (1 million in the United States) via proxy metering.
Page 6: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Search Referrals

March 2004 25 Millions Web Pages

[WebSideStory]

AG
examined a sample of over 25 million visits and found that Google had the top share of search referrals, 40.9 percent. It was followed by Yahoo at 27.4 percent, then MSN at 19.6 percent:
AG
Time Evolution
AG
Dropping Google
Page 8: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

A Cash Cow Business

Jupiter Media Metrix estimates Paid Ad will reach as much as $4 billion by 2005

Business growing rate increase of 20% in next five years

Page 9: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Google’s numbers

[Google’s IPO Sec Filing]

IPO To Happen, Files For Public Offering

$2,718,281,828

For those not blessed with a PhD and a job at google, is euler's number…

Page 10: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Web Ranking

Page 11: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Web Ranking

The author of p gives a vote to q

p q

Page 12: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Hits Eigenvectors computation can be used by:

Wherea: Vector of Authorities’ scores

h: Vector of Hubs’ scores. W: Adjacency matrix in which wi,j = 1 if points to j.

WaWahWWh

hWa

WahT

T

T

Page 13: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Hits

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

Page 14: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Salsa Two separate random walks

Hub walk Authority walk

Page 15: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Hits vs Salsa

H = WrWcT A = Wc

TWr

W is the adjacency matrix of G Wr is W divided by the sum of entries in its rows Wc is W divided by the sum of entries in its cols

Stationary distribution proportional to in-links and out-links!!

Page 16: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Google’s PageRank

Page 17: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Google’s PageRank

““Random Surfer Model” - Rank Random Surfer Model” - Rank of page equals to the probability of page equals to the probability of sitting on that pageof sitting on that page

WhereWhereB(i) : set of pages inlinking to i.B(i) : set of pages inlinking to i.

N(j) : num outgoing links from jN(j) : num outgoing links from j..

Page 18: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Cyclic paths Surfer get bored and jump to another place

ow

jiioutP ji

0)deg(

1,

Google’s PageRank

Dangling nodes, i.e. Web pages with no outlinks

P, Web Graph Matrix

v is a personalization vector, α is the dumping factor

Page 19: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Personalized PageRank

Biased Rank

                                                                           

ab

[Hawelivala 02]

Page 20: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Eurekester

Create and join SearchGroups to focus your search by area of interest

Page 21: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Fast PageRank

Page 22: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

PageRank

Standard Algorithm for computing PR: Power Method applied to

Takes several days due to the size of Web Graph

Page 23: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Why we need a fast link-based rank?

“…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…”

[ Cho et al., 04 ]

Page 24: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRank

Web Graph Compression to fit in internal memory [Boldi et al., 04]

Efficient External memory implementation [Haveliwala, 99; Chen et al., 02]

Mathematical approaches

Combination of the above strategies

Page 25: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRankAdaptive Power method:

C = set of pages converged, N = set of pages not yet converged

Run PM on

detecting converged components. In the paper, many other adapting strategies!!

Slow-converging pages have high PageRank

C

N

AA

A

[ Kamvar et al., 03 ]

SpeedUp: 22% time reduction, Precision: 10-3

DataSet: 280.000 nodes

Page 26: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRankExtrapolation strategies:

where ui eigvs

uauau mmx

221)0(

uauau mnmm

nnnn xAAxx 2221

)0()1()(

[ Kamvar et al., 03 ]

periodically subtract off estimates of non-principal eigenvectors from x(k) … Much improved over PM as α → 1

SpeedUp: 69% time reduction, Precision: 10-3

DataSet: 80Millions nodes

Page 27: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRank

Block Structure Reordering web pages

according to a lexicographical order.

Compute “local Rank” Create a new starting vector

[ Kamvar et al., 03 ]

Stanford

Berkeley

SpeedUp: 75% time reduction, Precision: 10-3

DataSet: 70Million nodes

Page 28: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRank

Sparse Linear Permutation

Viewing PR as a linear system problem Transforming it in a sparse formulation Exploiting reducibility via permutations Comparing different scalar and block solvers

[ Del Corso et al., 04 ]SpeedUp: 89% time reduction, Precision: 10-7

DataSet: 24M nodes

Page 29: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

“Rich Get Richer” phenomenon

“.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…”

[ Cho et al., 04 ]

Page 30: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Web Spamming

Page 31: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Spam Farm (SF), rules of thumb Use all available own pages in the SF, ↑ rstatic

Accumulate the maximum number of inlinks to SF, ↑ rin. Suppress links pointing outside the SF, rout = 0. Avoid dangling nodes within the SF, every page (including t)

has some outlinks.

Spamming PageRank

[Garcia-Molina et al., 04]

An Optimal Link Structure

danglingoutinstatictotal rrrrr

AG
where rstatic is the score gained from the static score distribution (random jump); rin is thescore flowing into the pages through the incoming links from external pages; rout is the scoreleaving the pages through their outgoing links to external pages; and rsink is the score lossdue to sink pages within the group (i.e., pages without outgoing links).
Page 32: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Spamming PageRank

Setting up sophisticated link structures within a spam farm does not improve the ranking of the target page.

Page 33: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Spamming Hits

Easy to spam Create a new page Create a new page p p pointing to many pointing to many

authority pages (e.g., Yahoo, Google, etc.)authority pages (e.g., Yahoo, Google, etc.) pp becomes a good hub page becomes a good hub page

… … On On pp, add a link to your home page, add a link to your home page

Page 34: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Fun Results (aka “Google Bombing”)

Page 35: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Fun Search Resuls and Demo

Page 36: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Fun Results (aka “Google Bombing”) Some Recent (as of 2004) and popular examples :

“weapons of mass destruction - hoax, IE error look-a-like

saying “weapons of mass destruction cannot be found”. great president - biography of George W. Bush. litigious bastards - homepage of the SCO Group. Buffone - Facce da culo - Discorsi Folli – Silvio Berlusconi out of touch executives – Google’s own corporate info page Waffle – John Kerry’s site (blog spamming campaign)

[ wikipedia ]

Page 37: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Will Google still dominate search in 2005?

Every three years, a new search engine takes the lead and has its 15 minutes of fame.

A timeline is at http://www.investors.com/ Open Source alternative [ Nutch ]

Page 38: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Page 39: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Comparing Ranks (Online Demo)

Page 40: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Bibliography K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked

Environment, SIGIR Conference, 1998 P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. To

appear in Proc. of the Thirteenth International World−Wide Web Conference. S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search engine,

Computer Networks and ISDN Systems vol. 30 num 1-7, 1998 S Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine,

WWW Conference, 1998 M. Bianchini, M. Gori, F. Scarselli, "Inside PageRank". Technical report DII 1/03,

Department of Information Engineering, University of Siena, 2001. Y.Y. Chen, Q. Gan, T. Suel: I/O-Efficient Techniques for Computing Pagerank",

Proceedings of the eleventh international conference on Information and knowledge management

J. Cho, S. Roy: Impact of Web Search Engines on Page Popularity In Proceedings of the World-Wide Web Conference (WWW), May 2004.

G.M. Del Corso, A. Gulli, F. Romani: Fast PageRank Computation Via a Sparse Linear System, ITT-CNR TechReport 2004

C.P.C Lee, G.H. Golub, S.A. Zenios: A Fast two stage algorithm for computing PageRank, Stanford Tech-Report 2004

Page 41: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Bibliography R. Lempel, S. Moran: SALSA: The Stochastic Approach for Link-Structure Analysis,

ACM Transactions on Information Systems Vol. 19 No.2, 2001 T. H. Haveliwala: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm

for Web Search, IEEE Trans. on Knowledge and Data Eng, 2003 T. H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh, "An Analytical Comparison of

Approaches to Personalizing PageRank", Preprint, June, 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Extrapolation Methods for

Accelerating PageRank Computations, WWW Conf., 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Exploiting the Block

Structure of the Web for Computing PageRank, Stanford Tech.Rep, 2003 S.D. Kamvar, T. H. Haveliwala, and G. H. Golub, "Adaptive Methods for the

Computation of PageRank", Linear Algebra and its Applications, Special Issue on the Numerical Solution of Markov Chains, Nov., 2003.

Kleinberg: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM Vol.46 No.5, 1999

A. Ntoulas, J. Cho, C. Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." World-Wide Web Conference, May 2004.

G., Zoltan; Garcia-Molina, Hector. Web Spam Taxonomy. Technical Report, Stanford University, 2004

Page 42: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Page 43: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Broder’s Altavista

Patented May 2003

A, Attractor Matrix: sites externally endorsed N, Non Attractor Matrix: sites deemed to be avoided Use a linear combination of A, N and other matrices

Suggest to also use non principal eigenvectors

Page 44: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Accelerating PageRank

Two-stage algorithmThe Markov Chain associated with P is lumpable

Combine D nodes into a block. P1 is the transition matrix

Compute the stationary distribution of P1Combine ND nodes into a block. P2 is the transition

matrixCompute the stationary distribution of P2Concatenate the resultsD are the dangling nodes, ND the non dangling nodes

[ Lee et al., 04 ]SpeedUp: 80% time reduction, Precision: 10-9

DataSet: 451.000 nodes

Page 45: Ranking the Web

Fun 04 G.M. Del Corso, A. Gulli

Finally…the perfect search engine?

Sergei Brin: “It would be the mind of God. Larry says it would know exactly what you want and give you back exactly what you need.”

Chackabarti: “The web grew exponentially from almost zero to 800 million pages between 1991 and 1999. In comparison, it took 3.5 million years for the human brain to grow linearly from 400 to 1400 cubic centimeters. How do we work with the web without getting overwhelmed? We look for relevance and quality. Can we design programs to recognize these properties?”