link analysis for web search

40
Emrullah Delibas

Upload: emrullah-delibas

Post on 16-Apr-2017

127 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Link analysis for web search

Emrullah Delibas

Page 2: Link analysis for web search

� The Problem of Ranking •  Objectives, Challenges

� Early Assumptions & Approaches � Link-Based Ranking Algorithms

•  InDegree Algorithm •  Hubs and Authorities: HITS •  PageRank •  SALSA •  Hilltop

� Search Engine Spamming � Problems with Non-textual Context

Page 3: Link analysis for web search

� “Cornell” •  Did the searcher want information about the

university? •  The university’s hockey team? •  The Lab of Ornithology run by the university? •  Cornell College in Iowa? •  The Nobel-Prize-winning physicist Eric Cornell?

The same ranking of search results can’t be right for everyone.

Page 4: Link analysis for web search

�  Objectives: •  To categorize webpages •  To find pages related to given pages •  To find duplicated websites •  To calculate the ‘quality’ of a web link •  To get the most ‘relevant’ web links based on a given query •  To model human judgments indirectly •  …

�  Challenges: •  Searching by itself is a hard problem for computers to solve in any

setting •  scale and complexity on the Web •  problems of synonymy and polysemy •  dynamic and constantly-changing nature of Web content •  …

Page 5: Link analysis for web search

� Back in the 1990’s, web search was purely based on the number of occurrences of a word in a document.

� The search was purely and only based on relevancy of a document with the query.

Simply getting the relevant documents wasn’t sufficient as the number of relevant documents may range in a few millions.

Page 6: Link analysis for web search

�  Links are assumed to be endorsements •  Disagreement •  Self-citation •  Link to a popular document

�  Hyperlinks contain information about the human judgment

of a site

�  The more incoming links to a site, the more it is judged

�  The Web is not a random network

-Bray, Tim. "Measuring the web." Computer networks and ISDN systems 28.7 (1996): 993-1005. -Marchiori, Massimo. "The quest for correct information on the web: Hyper search engines." Computer Networks and ISDN Systems 29.8 (1997): 1225-1235.

Page 7: Link analysis for web search

� Hyperlinks are not at random, they provide valuable information for: •  Link-based ranking •  Structure analysis •  Detection of communities •  Spam detection •  …

Page 8: Link analysis for web search
Page 9: Link analysis for web search

� This approach could be seen as the basis of each and every link analysis ranking algorithm.

� The link recommendation assumption is that by linking to another page, the author recommends it. •  So, a page with many incoming links has been highly

recommended.

� The ranking is just base on the authority and no weighting of authority values.

Page 10: Link analysis for web search
Page 11: Link analysis for web search

Hypertext Induced Topic Selection

Page 12: Link analysis for web search

� The basic idea is that relevant pages (“authorities”) are linked to by many other pages (“hubs”).

� The algorithm is now a part of the Ask search engine.

Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. A preliminary version appears in the Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, Jan. 1998.

Page 13: Link analysis for web search

� It is developed by looking at the way how humans analyze a search process rather than the machines searching up a query by looking at a bunch of documents and return the matches.

� For example; •  “top automobile makers in the world”

Page 14: Link analysis for web search

� Rules: •  A good hub points to many good authorities. •  A good authority is pointed to by many good

hubs. •  Authorities and hubs have a mutual

reinforcement relationship.

Page 15: Link analysis for web search
Page 16: Link analysis for web search

� Objective: Sq •  (i) Sq is relatively small •  (ii) Sq is rich in relevant pages •  (iii) Sq contains most (or many) of the strongest

authorities � Solution

•  Generate a Root Set Qσ from text-based search engine

•  Expand the root set

Page 17: Link analysis for web search
Page 18: Link analysis for web search

� Let authority score of the page i be x(i), and the hub score of page i be y(i).

� mutual reinforcing relationship: •  I step:

•  O step:

Page 19: Link analysis for web search

� 1st iteration

Page 20: Link analysis for web search

� 1st iteration

•  I step

Page 21: Link analysis for web search

� 1st iteration

•  I step •  O step

Page 22: Link analysis for web search

� 2nd iteration

•  I step

Page 23: Link analysis for web search

� 2nd iteration

•  I step •  O step

Page 24: Link analysis for web search

� 2nd iteration

•  I step •  O step •  … •  ... •  ...

Page 25: Link analysis for web search

1.  must be built “on the fly” 2.  suffers from topic drift 3.  cannot detect advertisements 4.  can easily be spammed 5.  query time evaluation is slow

Page 26: Link analysis for web search

Heart of Google

Page 27: Link analysis for web search

� Proposed by by Sergey Brin and Lawrence Page

� Uses a recursive scheme similar to Kleinberg’s HITS algorithm

� But the PageRank algorithm produces a ranking, independent of a user’s query.

Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. In Proc. 7th International World Wide Web Conference, pages 107–117, 1998.

Page 28: Link analysis for web search

� A page is important if it is pointed to by other important pages.

Page 29: Link analysis for web search

� The PageRank of a page pi is given as follows: •  Suppose that the page pi has pages M(pi) linking

to it. •  L(pj) is the number of outbound links on page pj.

Page 30: Link analysis for web search
Page 31: Link analysis for web search
Page 32: Link analysis for web search

� The algorithm is robust against Spam •  since its not easy for a webpage owner to add in-

links to his/her page from other important pages.

� PageRank is a global measure and is query independent.

Page 33: Link analysis for web search

� It favors the older pages •  Since new ones will not have many links

� PageRank can be easily increased by the concept of “link-farms” •  However, while indexing, the search actively

tries to find these flaws.

Page 34: Link analysis for web search

� Rank Sinks: occurs when in a network pages get in infinite link cycles

� Spider Traps: occurs if there are no links from within the group to outside the group.

� Dangling Links: occurs when a page contains a link such that the hypertext points to a page with no outgoing links.

� Dead Ends: pages with no outgoing links.

Page 35: Link analysis for web search
Page 36: Link analysis for web search

� Damping Factor •  random jumps (teleportation) � where N is the total number of pages � Typically d ≈ 0.85

Page 37: Link analysis for web search

PAGERANK HITS

�  Computed for all web-pages stored prior to the query

�  Computes authorities only �  Fast to compute �  No need for additional

normalization

�  Performed on the subset generated by each query.

�  Computes authorities and hubs

�  Easy to compute, real-time execution is hard.

�  There is need for normalization

Page 38: Link analysis for web search

Criteria HITS PageRank

Complexity Analysis O(kN2) O(n)

Result quality Less than PageRank algorithm

Medium

Relevancy Less. Since this algorithm ranks the pages on the indexing time

More since this algorithm uses the hyperlinks to give good results and also consider the content of the page

Neighborhood applied to the local neighborhood of pages surrounding the results of a query

applied to entire web

Grover, Nidhi, and Ritika Wason. "Comparative analysis of pagerank and hits algorithms." International Journal of Engineering Research and Technology. Vol. 1. No. 8 (October-2012). ESRSA Publications, 2012.

Page 39: Link analysis for web search

�  Keyword-Stuffing: Overloading the website with relevant keywords.

�  Text-Hidding: Placing relevant content on the website which can only be seen by search engines.

�  Doorway-Page: A page which is very well optimized for some keywords and with the only purpose to redirect to a real website.

�  Link-farms: Websites which are optimized for some keywords and contains only a huge number of links to other websites.

Page 40: Link analysis for web search

� Flash: rarely processed by search engines

� Java Applets: normally not processed.

� Videos and Images: not directly processable for search engines.

� Other Rich-Media Formats: (e.g. Silverlight) which are typically not processed by search engines.