ranking the web
DESCRIPTION
Ranking the Web. Gianna M. Del CorsoAntonio Gull í Dipartimento Informatica, Pisa IIT-CNR, Pisa. Overview. Web Statistics Some Web Ranking Algorithms Zooming on PageRank Personalization Fast PageRank Fun Results and Web Comparison Online demo. Web Statistics. Web Statistics. - PowerPoint PPT PresentationTRANSCRIPT
Ranking the Web
Gianna M. Del Corso Antonio Gullí
Dipartimento Informatica, PisaIIT-CNR, Pisa
Fun 04 G.M. Del Corso, A. Gulli
Overview
Web Statistics Some Web Ranking Algorithms Zooming on PageRank
Personalization Fast PageRank
Fun Results and Web Comparison Online demo
Fun 04 G.M. Del Corso, A. Gulli
Web Statistics
Fun 04 G.M. Del Corso, A. Gulli
Web Statistics
January 2004, 151 millions active in the U.S. 76% used a SE at least once a month. Time spent searching ~ 40 mins.
[Nielsen//NetRatings]
Fun 04 G.M. Del Corso, A. Gulli
Share Of Searches: February 2004
February 2004 1.5Millions US
web surfers
[comScore Media Metrix]
Fun 04 G.M. Del Corso, A. Gulli
Search Referrals
March 2004 25 Millions Web Pages
[WebSideStory]
Fun 04 G.M. Del Corso, A. Gulli
[google-watch.org]
Fun 04 G.M. Del Corso, A. Gulli
A Cash Cow Business
Jupiter Media Metrix estimates Paid Ad will reach as much as $4 billion by 2005
Business growing rate increase of 20% in next five years
Fun 04 G.M. Del Corso, A. Gulli
Google’s numbers
[Google’s IPO Sec Filing]
IPO To Happen, Files For Public Offering
$2,718,281,828
For those not blessed with a PhD and a job at google, is euler's number…
Fun 04 G.M. Del Corso, A. Gulli
Web Ranking
Fun 04 G.M. Del Corso, A. Gulli
Web Ranking
The author of p gives a vote to q
p q
Fun 04 G.M. Del Corso, A. Gulli
Hits Eigenvectors computation can be used by:
Wherea: Vector of Authorities’ scores
h: Vector of Hubs’ scores. W: Adjacency matrix in which wi,j = 1 if points to j.
WaWahWWh
hWa
WahT
T
T
Fun 04 G.M. Del Corso, A. Gulli
Hits
AuthorityHubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
Fun 04 G.M. Del Corso, A. Gulli
Salsa Two separate random walks
Hub walk Authority walk
Fun 04 G.M. Del Corso, A. Gulli
Hits vs Salsa
H = WrWcT A = Wc
TWr
W is the adjacency matrix of G Wr is W divided by the sum of entries in its rows Wc is W divided by the sum of entries in its cols
Stationary distribution proportional to in-links and out-links!!
Fun 04 G.M. Del Corso, A. Gulli
Google’s PageRank
Fun 04 G.M. Del Corso, A. Gulli
Google’s PageRank
““Random Surfer Model” - Rank Random Surfer Model” - Rank of page equals to the probability of page equals to the probability of sitting on that pageof sitting on that page
WhereWhereB(i) : set of pages inlinking to i.B(i) : set of pages inlinking to i.
N(j) : num outgoing links from jN(j) : num outgoing links from j..
Fun 04 G.M. Del Corso, A. Gulli
Cyclic paths Surfer get bored and jump to another place
ow
jiioutP ji
0)deg(
1,
Google’s PageRank
Dangling nodes, i.e. Web pages with no outlinks
P, Web Graph Matrix
v is a personalization vector, α is the dumping factor
Fun 04 G.M. Del Corso, A. Gulli
Personalized PageRank
Biased Rank
ab
[Hawelivala 02]
Fun 04 G.M. Del Corso, A. Gulli
Eurekester
Create and join SearchGroups to focus your search by area of interest
Fun 04 G.M. Del Corso, A. Gulli
Fast PageRank
Fun 04 G.M. Del Corso, A. Gulli
PageRank
Standard Algorithm for computing PR: Power Method applied to
Takes several days due to the size of Web Graph
Fun 04 G.M. Del Corso, A. Gulli
Why we need a fast link-based rank?
“…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…”
[ Cho et al., 04 ]
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRank
Web Graph Compression to fit in internal memory [Boldi et al., 04]
Efficient External memory implementation [Haveliwala, 99; Chen et al., 02]
Mathematical approaches
Combination of the above strategies
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRankAdaptive Power method:
C = set of pages converged, N = set of pages not yet converged
Run PM on
detecting converged components. In the paper, many other adapting strategies!!
Slow-converging pages have high PageRank
C
N
AA
A
[ Kamvar et al., 03 ]
SpeedUp: 22% time reduction, Precision: 10-3
DataSet: 280.000 nodes
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRankExtrapolation strategies:
where ui eigvs
uauau mmx
221)0(
uauau mnmm
nnnn xAAxx 2221
)0()1()(
[ Kamvar et al., 03 ]
periodically subtract off estimates of non-principal eigenvectors from x(k) … Much improved over PM as α → 1
SpeedUp: 69% time reduction, Precision: 10-3
DataSet: 80Millions nodes
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRank
Block Structure Reordering web pages
according to a lexicographical order.
Compute “local Rank” Create a new starting vector
[ Kamvar et al., 03 ]
Stanford
Berkeley
SpeedUp: 75% time reduction, Precision: 10-3
DataSet: 70Million nodes
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRank
Sparse Linear Permutation
Viewing PR as a linear system problem Transforming it in a sparse formulation Exploiting reducibility via permutations Comparing different scalar and block solvers
[ Del Corso et al., 04 ]SpeedUp: 89% time reduction, Precision: 10-7
DataSet: 24M nodes
Fun 04 G.M. Del Corso, A. Gulli
“Rich Get Richer” phenomenon
“.. From our experimental data, we could observe that the top 20% of the pages with the highest number of incoming links obtained 70% of the new links after 7 months, while the bottom 60% of the pages obtained virtually no new incoming links during that period…”
[ Cho et al., 04 ]
Fun 04 G.M. Del Corso, A. Gulli
Web Spamming
Fun 04 G.M. Del Corso, A. Gulli
Spam Farm (SF), rules of thumb Use all available own pages in the SF, ↑ rstatic
Accumulate the maximum number of inlinks to SF, ↑ rin. Suppress links pointing outside the SF, rout = 0. Avoid dangling nodes within the SF, every page (including t)
has some outlinks.
Spamming PageRank
[Garcia-Molina et al., 04]
An Optimal Link Structure
danglingoutinstatictotal rrrrr
Fun 04 G.M. Del Corso, A. Gulli
Spamming PageRank
Setting up sophisticated link structures within a spam farm does not improve the ranking of the target page.
Fun 04 G.M. Del Corso, A. Gulli
Spamming Hits
Easy to spam Create a new page Create a new page p p pointing to many pointing to many
authority pages (e.g., Yahoo, Google, etc.)authority pages (e.g., Yahoo, Google, etc.) pp becomes a good hub page becomes a good hub page
… … On On pp, add a link to your home page, add a link to your home page
Fun 04 G.M. Del Corso, A. Gulli
Fun Results (aka “Google Bombing”)
Fun 04 G.M. Del Corso, A. Gulli
Fun Search Resuls and Demo
Fun 04 G.M. Del Corso, A. Gulli
Fun Results (aka “Google Bombing”) Some Recent (as of 2004) and popular examples :
“weapons of mass destruction - hoax, IE error look-a-like
saying “weapons of mass destruction cannot be found”. great president - biography of George W. Bush. litigious bastards - homepage of the SCO Group. Buffone - Facce da culo - Discorsi Folli – Silvio Berlusconi out of touch executives – Google’s own corporate info page Waffle – John Kerry’s site (blog spamming campaign)
[ wikipedia ]
Fun 04 G.M. Del Corso, A. Gulli
Will Google still dominate search in 2005?
Every three years, a new search engine takes the lead and has its 15 minutes of fame.
A timeline is at http://www.investors.com/ Open Source alternative [ Nutch ]
Fun 04 G.M. Del Corso, A. Gulli
Fun 04 G.M. Del Corso, A. Gulli
Comparing Ranks (Online Demo)
Fun 04 G.M. Del Corso, A. Gulli
Bibliography K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked
Environment, SIGIR Conference, 1998 P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. To
appear in Proc. of the Thirteenth International World−Wide Web Conference. S. Brin and L. Page, The anatomy of a large-scale hypertextual Web search engine,
Computer Networks and ISDN Systems vol. 30 num 1-7, 1998 S Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine,
WWW Conference, 1998 M. Bianchini, M. Gori, F. Scarselli, "Inside PageRank". Technical report DII 1/03,
Department of Information Engineering, University of Siena, 2001. Y.Y. Chen, Q. Gan, T. Suel: I/O-Efficient Techniques for Computing Pagerank",
Proceedings of the eleventh international conference on Information and knowledge management
J. Cho, S. Roy: Impact of Web Search Engines on Page Popularity In Proceedings of the World-Wide Web Conference (WWW), May 2004.
G.M. Del Corso, A. Gulli, F. Romani: Fast PageRank Computation Via a Sparse Linear System, ITT-CNR TechReport 2004
C.P.C Lee, G.H. Golub, S.A. Zenios: A Fast two stage algorithm for computing PageRank, Stanford Tech-Report 2004
Fun 04 G.M. Del Corso, A. Gulli
Bibliography R. Lempel, S. Moran: SALSA: The Stochastic Approach for Link-Structure Analysis,
ACM Transactions on Information Systems Vol. 19 No.2, 2001 T. H. Haveliwala: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm
for Web Search, IEEE Trans. on Knowledge and Data Eng, 2003 T. H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh, "An Analytical Comparison of
Approaches to Personalizing PageRank", Preprint, June, 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Extrapolation Methods for
Accelerating PageRank Computations, WWW Conf., 2003 S.D. Kamvar, T.H. Haveliwala, C.D. Manning, G.H. Golub: Exploiting the Block
Structure of the Web for Computing PageRank, Stanford Tech.Rep, 2003 S.D. Kamvar, T. H. Haveliwala, and G. H. Golub, "Adaptive Methods for the
Computation of PageRank", Linear Algebra and its Applications, Special Issue on the Numerical Solution of Markov Chains, Nov., 2003.
Kleinberg: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM Vol.46 No.5, 1999
A. Ntoulas, J. Cho, C. Olston "What's New on the Web? The Evolution of the Web from a Search Engine Perspective." World-Wide Web Conference, May 2004.
G., Zoltan; Garcia-Molina, Hector. Web Spam Taxonomy. Technical Report, Stanford University, 2004
Fun 04 G.M. Del Corso, A. Gulli
Fun 04 G.M. Del Corso, A. Gulli
Broder’s Altavista
Patented May 2003
A, Attractor Matrix: sites externally endorsed N, Non Attractor Matrix: sites deemed to be avoided Use a linear combination of A, N and other matrices
Suggest to also use non principal eigenvectors
Fun 04 G.M. Del Corso, A. Gulli
Accelerating PageRank
Two-stage algorithmThe Markov Chain associated with P is lumpable
Combine D nodes into a block. P1 is the transition matrix
Compute the stationary distribution of P1Combine ND nodes into a block. P2 is the transition
matrixCompute the stationary distribution of P2Concatenate the resultsD are the dangling nodes, ND the non dangling nodes
[ Lee et al., 04 ]SpeedUp: 80% time reduction, Precision: 10-9
DataSet: 451.000 nodes
Fun 04 G.M. Del Corso, A. Gulli
Finally…the perfect search engine?
Sergei Brin: “It would be the mind of God. Larry says it would know exactly what you want and give you back exactly what you need.”
Chackabarti: “The web grew exponentially from almost zero to 800 million pages between 1991 and 1999. In comparison, it took 3.5 million years for the human brain to grow linearly from 400 to 1400 cubic centimeters. How do we work with the web without getting overwhelmed? We look for relevance and quality. Can we design programs to recognize these properties?”