web mining
DESCRIPTION
TRANSCRIPT
![Page 1: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/1.jpg)
Presented By:Akshat Saxena Anjul Sahu
![Page 2: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/2.jpg)
Definition
Application of data mining techniques on the web to discover interesting patterns.
![Page 3: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/3.jpg)
Introduction
Size of web is extremely large Data present on web is unstructured Good scope of data mining Types of data on web
Content of actual webpage Intrapage structure Interpage structure Usage data User profiles and cookies
![Page 4: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/4.jpg)
Web Mining Taxonomy
![Page 5: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/5.jpg)
Web Content Mining
Extends work of search engineImproves on traditional crawler
techniqueUse data mining for efficiency,
effectiveness and scalabilityFurther divided into
◦ Agent based approach◦ Database based approach
Text mining is/isn’t content miningCrawlersPersonalization
![Page 6: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/6.jpg)
Web Content Mining Subtasks
Resource finding Retrieving intended documents
Information selection/pre-processing Select and pre-process specific information
from selected documents Generalization
Discover general patterns within and across web sites
Analysis Validation and/or interpretation of mined
patterns
![Page 7: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/7.jpg)
Text Mining
![Page 8: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/8.jpg)
Web Crawler
Program which browses WWW in a methodical, automated manner
Copy in cache and do Indexing Starts from a seed url Searches and finds links, keywords Types of Crawler
Context focused Focused Incremental Periodic
![Page 9: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/9.jpg)
Focused Crawler
![Page 10: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/10.jpg)
Focused Crawler
Visits only pages of interest Architecture consists of:
Hyperlink Classifier Distiller Crawler
Hub pages - links to relevant pages Hard focus - parent node relevant Soft focus - probability of relevance Harvest rate – precision rate
![Page 11: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/11.jpg)
Context Focused Crawler
Focused crawler was static Drawbacks:
Non-relevant pages having links to relevant ones. These to be followed
Relevant ones not having links to other relevant ones. Backward crawling
CFC in two steps Construct context graphs and classifiers Crawl using these classifiers
![Page 12: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/12.jpg)
Harvest System
Uses caching, indexing and crawling Act as a tool in gathering information
from other sources Components:
Gatherer - obtains information Broker - provides index and query
interface Essence systems Semantic indexing
![Page 13: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/13.jpg)
Virtual Web View
Web as multiple layer database A view of MLDB is virtual web view No spiders used Websites send their indices to others WebML – DMQL for web mining KEYWORDS – covers, covered by,
like, close to Difficult to implement
![Page 14: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/14.jpg)
Personalization
Contents of web are modified as per user’s desires
Personalized not targeted Use cookies, userID, profile
information Legal issues to be considered Includes clustering, classification or
even prediction
![Page 15: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/15.jpg)
Personalization
Types: User preference Collaborative filtering Content based filtering
Example : My Yahoo! was first. Now almost every service offers personalization.
![Page 16: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/16.jpg)
Personalization
Yahoo was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual end-user.
Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site.
![Page 17: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/17.jpg)
Web Structure Mining
Creating a model of web organization
Classify web pages Create similarity measures between
web pages Page Rank The Clever system Hyperlink induced topic search(HITS)
![Page 18: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/18.jpg)
PageRankTM
Link analysis algorithm which assigns numerical weight to a webpage.
The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).
the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
![Page 19: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/19.jpg)
Page Rank
Increase effectiveness of search engines
Based on number of back links Rank sink problem exists
![Page 20: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/20.jpg)
Clever System
Finds both authoritative pages and hubs
Authoritative - best source Hub - link to authoritative pages Most value page returned Hyperlink Induced Topic Search
Keywords Authority and hub measure
![Page 21: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/21.jpg)
Alternatives to PageRank
HITS Algorithm IBM Clever Project TrustRank But PageRank is the most popular
and widely used algorithm by search engines
![Page 22: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/22.jpg)
Web Usage Mining
Applies mining on web usage data or weblogs or clickstream data
Client perspective Server perspective Aid in personalization Helps in evaluating quality and
effectiveness Preprocessing, pattern discovery and
data structures
![Page 23: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/23.jpg)
Trackers for site usage and analysis
![Page 24: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/24.jpg)
![Page 25: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/25.jpg)
Issues in Web Log
Identify exact user
Exact sequence of pages visited
Security, privacy and legal issues
![Page 26: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/26.jpg)
Preprocessing
Information not in presentable format
Data cleaning required Log: (<src
id>,<literal>,<timestamp>) Data might be grouped Sessions Path completion
![Page 27: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/27.jpg)
Data Structure
DS needed to keep track of patterns identified
DS used is trie A rooted tree where each path from
root to node represents a sequence
![Page 28: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/28.jpg)
Pattern Discovery
Traversal pattern - pages visited in a session
Properties: Duplicate reference may / may not be
allowed Consist of only contiguous page reference Pattern may / may not be maximal
Association rules - pages accessed together
![Page 29: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/29.jpg)
Pattern Discovery
Sequential Pattern - ordered set satisfying a support and maximal
Similar to apriori algorithm Web access pattern - efficient
counting Episodes – partially ordered by
access time; users not identified Pattern analysis
![Page 30: Web Mining](https://reader035.vdocuments.us/reader035/viewer/2022062614/546bb23caf7959cf258b4a43/html5/thumbnails/30.jpg)
Queries ‘N Suggestions
References: http://maya.cs.depaul.edu/~mobasher/w
ebminer/survey/ Google.com/Technology http://www.almaden.ibm.com/projects/
clever.shtml
Thanks !! {akshatsaxena11, anjulsahu}@gmail.com