Discovering Knowledge Using Web Structure
Mining
1. What is Web?
1.1 Problems With WebDifficulty in finding
relevant information
Personalization of information
Learning about consumers or individual users
2.Objectivesi. To Survey the area
of web mining.
ii. Introduction to Link Mining.
iii. Review of HITS and Page Rank algorithm.
3. Web Mining: DefinitionProcess of
discovering
potentially useful &
previously unknown
information or knowledge from the web data.
3.1 Web Mining: SubtasksResource finding
Information selection and pre-processing
Generalization
Analysis
3.1 Web Mining Categories
Web Mining
Web Content Mining Web Structure Mining Web Usage Mining
Text and Multimedia Documents
Web Log Records
Hyperlink Structure
3.1.1 Web Content Mining
Scanning data of a Web page to determine content relevance with respect to search query.
Web Content Mining
Agent Based Approach
Database Approach
3.1.2 Web Structure MiningIdentifies
relationships between Web pages.
Focuses on following problemsReducing irrelevant
search results.Helps indexing
information on the web.
3.1.3 Web Usage MiningFocuses on techniques that predict user behavior
while interacting with the WWW.
Web log records analyzed to discover user access pattern.
The challenges could be divided into three phases:
Pre-processingPattern discoveryPattern Analysis
4. Link MiningIt is located at the intersection of the work in
Link analysisHypertext and web miningRelational learning and inductive logic programming Graph mining.
Some tasks of link mining applicable in web structure mining are:Linked-based classificationLinked-based cluster analysisLink TypeLink StrengthLink Cardinality
(i) Link-based ClassificationPredicts category of a web
page, based on words that occur on the
page Links between pages anchor text HTML tags and other possible
attributes on web page.
Eg: Predicting the category of a paper, based on its citations and the co-citations.
(ii) Link-based Cluster AnalysisGoal : Finding naturally occurring subclasses.
Data is segmented into groups similar objects - grouped togetherdissimilar objects - different groups.
Helps in discovering hidden patterns.
Eg: Finding diseases with similar transmission pattern.
(iii) Link TypePredicting link
type between two entities.
Predicting purpose of a link.Eg. Navigational
or Advertising
(iv) Link StrengthLinks could be associated with weights.
Strong links - higher weight Weak links – lower weight
(v) Link CardinalityRefers to the
number of inbound links to a web site.
Link popularity :combination of
factors that weigh the importance of each incoming link.
5. Hyperlink-Induced Topic Search (HITS)Link analysis algorithm that
rates pages.
Identifies two kinds of pages from Web hyperlink structure:Authorities: Contains
valuable information on the subject.
Hubs: Contains useful links towards the authoritative pages.
Web Pages
WithLinks
To
OtherPages
WebPages
With
Content
Hubs Authority
HITS Contd…Two step process:
Sampling step: Set of relevant pages collected
Iterative step: Hubs and authorities are found using output of above step
HITS Contd…Sampling Step:
Query submitted to search engine yields a root set
From root set we expand to base set
Expanding the root set into base set
HITS Contd…Iterative step:
Associate non-negative authority weight x<p> and non-negative hub weight y<p>.
Computing Authority Weight Computing Hub Weight
Problems With HITS AlgorithmSome problems with the HITS algorithm are:
Mutually reinforced relationships between hosts
Automatically generated linksNon-relevant nodesHubs and authoritiesTopic driftEfficiency
6. PageRank ModelIt is a link analysis
algorithm.
Numeric value to know the importance of a web page
Computes importance by no. of incoming links
PageRank Contd…Rank of a page is divided evenly among its out-
links to contribute to the ranks of the pages they point to.
Page Ranks form a probability distribution over web pages, so the sum of all pages’ Page Ranks will be one.
PageRank Contd…PageRank can be calculated by:
PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn)) T1..Tn are the pages that point to page A. C(A) is defined as the number of links going out of page A. d is the dampening factor which is usually set to 0.85
The dampening factor is the probability at each page a random surfer will get bored and will request another random page.
ApplicationsHITS was used in Clever search engine by IBM.
PageRank is used by Google.
References Knowledge Discovery and Retrieval on World Wide Web Using Web
Structure Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS), IEEE.
Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD Explorations, Volume 4, Issue 2
Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In proceedings of ACM-SIAM Symposium on Discrete Algorithms
The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and T. Winograd, 1998, Technical report, Stanford University
wikipedia.org web-datamining.net maya.cs.depaul.edu
Thank You