![Page 1: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/1.jpg)
Web Mining : A Bird’s Eye View
Sanjay Kumar MadriaDepartment of Computer Science
University of Missouri-Rolla, MO [email protected]
![Page 2: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/2.jpg)
Web Mining
Web mining - data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996).
Web mining research – integrate research from several research communities (Kosala and Blockeel, July 2000) such as: Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)
![Page 3: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/3.jpg)
WWW is huge, widely distributed, global information source for Information services:
news, advertisements, consumer information, financial management, education, government, e-
commerce, etc. Hyper-link information Access and usage information Web Site contents and Organization
Mining the World-Wide Web
![Page 4: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/4.jpg)
Mining the World-Wide Web
Growing and changing very rapidly Broad diversity of user communities
Only a small portion of the information on the Web is truly relevant or useful to Web users
How to find high-quality Web pages on a specified topic?
WWW provides rich sources for data mining
![Page 5: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/5.jpg)
Challenges on WWW Interactions
Finding Relevant Information Creating knowledge from Information
available Personalization of the information Learning about customers / individual
users
Web Mining can play an important Role!
![Page 6: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/6.jpg)
Web Mining: more challenging
Searches for Web access patterns Web structures Regularity and dynamics of Web contents
Problems The “abundance” problem Limited coverage of the Web: hidden Web sources,
majority of data in DBMS Limited query interface based on keyword-oriented
search Limited customization to individual users Dynamic and semistructured
![Page 7: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/7.jpg)
Web Mining : Subtasks
Resource Finding Task of retrieving intended web-documents
Information Selection & Pre-processing Automatic selection and pre-processing specific
information from retrieved web resources
Generalization Automatic Discovery of patterns in web sites
Analysis Validation and / or interpretation of mined patterns
![Page 8: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/8.jpg)
Web Mining Taxonomy
Web Mining
Web Content Mining
Web Usage Mining
Web Structure
Mining
![Page 9: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/9.jpg)
Web Content Mining
Discovery of useful information from web contents / data / documents Web data contents:
text, image, audio, video, metadata and hyperlinks. Information Retrieval View ( Structured + Semi-
Structured) Assist / Improve information finding Filtering Information to users on user profiles
Database View Model Data on the web Integrate them for more sophisticated queries
![Page 10: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/10.jpg)
Issues in Web Content Mining
Developing intelligent tools for IR - Finding keywords and key phrases - Discovering grammatical rules and collocations
- Hypertext classification/categorization - Extracting key phrases from text documents
- Learning extraction models/rules - Hierarchical clustering - Predicting (words) relationship
![Page 11: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/11.jpg)
Cont….
Developing Web query systems WebOQL, XML-QL
Mining multimedia data Mining image from satellite (Fayyad, et al. 1996) Mining image to identify small volcanoes on Venus
(Smyth, et al 1996) .
![Page 12: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/12.jpg)
Web Structure Mining
To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about the Website and Web page.
Direction 1: based on the hyperlinks, categorizing the Web pages and generated information.
Direction 2: discovering the structure of Web document itself.
Direction 3: discovering the nature of the hierarchy or network of hyperlinks in the Website of a particular domain.
![Page 13: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/13.jpg)
Web Structure Mining
Finding authoritative Web pages Retrieving pages that are not only relevant,
but also of high quality, or authoritative on the topic
Hyperlinks can infer the notion of authority The Web consists not only of pages, but also
of hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of latent human annotation
A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page
![Page 14: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/14.jpg)
Web Structure Mining
Web pages categorization (Chakrabarti, et al., 1998)
Discovering micro communities on the web
- Example: Clever system (Chakrabarti, et al., 1999), Google (Brin and Page, 1998)
Schema Discovery in Semistructured Environment
![Page 15: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/15.jpg)
Web Usage Mining
Web usage mining also known as Web log mining mining techniques to discover interesting
usage patterns from the secondary data derived from the interactions of the users while surfing the web
![Page 16: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/16.jpg)
Web Usage Mining Applications
Target potential customers for electronic commerce
Enhance the quality and delivery of Internet information services to the end user
Improve Web server system performance Identify potential prime advertisement
locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)
![Page 17: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/17.jpg)
![Page 18: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/18.jpg)
Problems with Web Logs
Identifying users – Clients may have multiple streams
– Clients may access web from multiple hosts– Proxy servers: many clients/one address– Proxy servers: one client/many addresses
Data not in log– POST data (i.e., CGI request) not recorded– Cookie data stored elsewhere
![Page 19: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/19.jpg)
Cont…
Missing data Pages may be cached Referring page requires client cooperation When does a session end? Use of forward and backward pointers
Typically a 30 minute timeout is used Web content may be dynamic
May not be able to reconstruct what the user saw
Use of spiders and automated agents – automatic request we pages
![Page 20: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/20.jpg)
Cont…
Like most data mining tasks, web log mining requires preprocessing To identify users To match sessions to other data To fill in missing data Essentially, to reconstruct the click stream
![Page 21: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/21.jpg)
Log Data - Simple Analysis
Statistical analysis of users Length of path Viewing time Number of page views
Statistical analysis of site Most common pages viewed Most common invalid URL
![Page 22: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/22.jpg)
Web Log – Data Mining Applications
Association rules Find pages that are often viewed together
Clustering Cluster users based on browsing patterns Cluster pages based on content
Classification Relate user attributes to patterns
![Page 23: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/23.jpg)
Web Logs
Web servers have the ability to log all requests
Web server log formats: Most use the Common Log Format (CLF) New, Extended Log Format allows configuration
of log file
Generate vast amounts of data
![Page 24: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/24.jpg)
Remotehost: browser hostname or IP # Remote log name of user
(almost always "-" meaning "unknown")
Authuser: authenticated username Date: Date and time of the request "request”: exact request lines from
client Status: The HTTP status code returned Bytes: The content-length of response
Common Log Format
![Page 25: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/25.jpg)
Server Logs
![Page 26: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/26.jpg)
Fields
Client IP: 128.101.228.20 Authenticated User ID: - - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0" Status: 200 Bytes: - Referrer: “-” Agent: "Mozilla/4.61 [en] (WinNT; I)"
![Page 27: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/27.jpg)
Web Usage Mining Commonly used approaches (Borges and
Levene, 1999) - Maps the log data into relational tables before an adapted data mining technique is performed. - Uses the log data directly by utilizing special pre-processing techniques.
Typical problems - Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers (McCallum, et al., 2000; Srivastava, et al., 2000).
![Page 28: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/28.jpg)
Request
Method: GET
– Other common methods are POST and HEAD URI: / – This is the file that is being accessed. When a
directory is specified, it is up to the Server to
decide what to return. Usually, it will be the file
named “index.html” or “home.html” Protocol: HTTP/1.0
![Page 29: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/29.jpg)
Status
Status codes are defined by the HTTP
protocol. Common codes include:
– 200: OK
– 3xx: Some sort of Redirection
– 4xx: Some sort of Client Error
– 5xx: Some sort of Server Error
![Page 30: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/30.jpg)
![Page 31: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/31.jpg)
Web Mining
Web Structure
Mining
Web ContentMining
Search ResultMining
Web PageContent Mining
General AccessPattern
Tracking
CustomizedUsage
Tracking
Web UsageMining
Web Mining Taxonomy
![Page 32: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/32.jpg)
Web Mining
Web StructureMining
Web ContentMining
Web Page Content MiningWeb Page Summarization WebOQL(Mendelzon et.al. 1998) …:Web Structuring query languages; Can identify information within given web pages •(Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World Wide Web
![Page 33: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/33.jpg)
Web Mining
Mining the World Wide Web
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web StructureMining
Web ContentMining
Web PageContent Mining Search Result Mining
Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles and snippets
![Page 34: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/34.jpg)
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World Wide Web
Web Structure Mining Using Links•PageRank (Brin et al., 1998)•CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages.
Using Generalization•MLDB (1994)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.
![Page 35: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/35.jpg)
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and Han, 1998)Uses KDD techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.
CustomizedUsage Tracking
Mining the World Wide Web
![Page 36: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/36.jpg)
Web Mining
Web UsageMining
General AccessPattern Tracking
Customized Usage Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.
Mining the World Wide Web
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
![Page 37: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/37.jpg)
Web Content Mining
Agent-based Approaches: Intelligent Search Agents Information Filtering/Categorization Personalized Web Agents
Database Approaches: Multilevel Databases Web Query Systems
![Page 38: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/38.jpg)
Intelligent Search Agents
Locating documents and services on the Web: WebCrawler, Alta Vista
(http://www.altavista.com): scan millions of Web documents and create index of words (too many irrelevant, outdated responses)
MetaCrawler: mines robot-created indices
Retrieve product information from a variety of vendor sites using only general information about the product domain: ShopBot
![Page 39: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/39.jpg)
Intelligent Search Agents (Cont’d)
Rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents: Harvest FAQ-Finder Information Manifold OCCAM Parasite
Learn models of various information sources and translates these into its own concept hierarchy: ILA (Internet Learning Agent)
![Page 40: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/40.jpg)
Information Filtering/Categorization
Using various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit: uses semantic information embedded in link
structures and document content to create cluster hierarchies of hypertext documents, and structure an information space
BO (Bookmark Organizer): combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information
![Page 41: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/41.jpg)
Personalized Web Agents
This category of Web agents learn user preferences and discover Web information sources based on these preferences, and those of other individuals with similar interests (using collaborative filtering) WebWatcher PAINT Syskill&Webert GroupLens Firefly others
![Page 42: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/42.jpg)
Multiple Layered Web Architecture
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
![Page 43: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/43.jpg)
Multilevel Databases
At the higher levels, meta data or generalizations are extracted from lower levels organized in structured collections, i.e. relational
or object-oriented database.
At the lowest level, semi-structured information are stored in various Web repositories, such as
hypertext documents
![Page 44: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/44.jpg)
Multilevel Databases (Cont’d)
(Han, et. al.): use a multi-layered database where each layer
is obtained via generalization and transformation operations performed on the lower layers
(Kholsa, et. al.): propose the creation and maintenance of meta-
databases at each information providing domain and the use of a global schema for the meta-database
![Page 45: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/45.jpg)
Multilevel Databases (Cont’d)
(King, et. al.): propose the incremental integration of a portion
of the schema from each information source, rather than relying on a global heterogeneous database schema
The ARANEUS system: extracts relevant information from hypertext
documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views
![Page 46: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/46.jpg)
Multi-Layered Database (MLDB) A multiple layered database model
based on semi-structured data hypothesis queried by NetQL using a syntax similar to the relational
language SQL Layer-0:
An unstructured, massive, primitive, diverse global information-base.
Layer-1: A relatively structured, descriptor-like, massive,
distributed database by data analysis, transformation and generalization techniques.
Tools to be developed for descriptor extraction. Higher-layers:
Further generalization to form progressively smaller, better structured, and less remote databases for efficient browsing, retrieval, and information discovery.
![Page 47: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/47.jpg)
Three major components in MLDB
S (a database schema): outlines the overall database structure of the global MLDB presents a route map for data and meta-data (i.e., schema)
browsing describes how the generalization is performed
H (a set of concept hierarchies): provides a set of concept hierarchies which assist the system
to generalize lower layer information to high layeres and map queries to appropriate concept layers for processing
D (a set of database relations): the whole global information base at the primitive
information level (i.e., layer-0) the generalized database relations at the nonprimitive layers
![Page 48: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/48.jpg)
The General architecture of WebLogMiner(a Global MLDB)
Site 1
Site 2
Site 3
Generalized Data
Concept Hierarchies
Higher layers
Resource Discovery(MLDB)
Knowledge Discovery (WLM)Characteristic RulesDiscriminant RulesAssociation Rules
![Page 49: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/49.jpg)
Techniques for Web usage mining
Construct multidimensional view on the Weblog database Perform multidimensional OLAP analysis to find the top N
users, top N accessed Web pages, most frequently accessed time periods, etc.
Perform data mining on Weblog records Find association patterns, sequential patterns, and trends
of Web accessing May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer Conduct studies to
Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping
![Page 50: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/50.jpg)
Web Usage Mining - Phases
Three distinctive phases: preprocessing, pattern discovery, and pattern analysis
Preprocessing - process to convert the raw data into the data abstraction necessary for the further applying the data mining algorithm
Resources: server-side, client-side, proxy servers, or database.
Raw data: Web usage logs, Web page descriptions, Web site topology, user registries, and questionnaire.
Conversion: Content converting, Structure converting, Usage converting
![Page 51: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/51.jpg)
User: The principal using a client to interactively retrieve and render resources or resource manifestations.
Page view: Visual rendering of a Web page in a specific client environment at a specific point of time
Click stream: a sequential series of page view request
User session: a delimited set of user clicks (click stream) across one or more Web servers.
Server session (visit): a collection of user clicks to a single Web server during a user session.
Episode: a subset of related user clicks that occur within a user session.
![Page 52: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/52.jpg)
Content Preprocessing - the process of converting text, image, scripts and other files into the forms that can be used by the usage mining.
Structure Preprocessing - The structure of a Website is formed by the hyperlinks between page views, the structure preprocessing can be done by parsing and reformatting the information.
Usage Preprocessing - the most difficult task in the usage mining processes, the data cleaning techniques to eliminate the impact of the irrelevant items to the analysis result.
![Page 53: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/53.jpg)
Pattern Discovery Pattern Discovery is the key component of
the Web mining, which converges the algorithms
and techniques from data mining, machine learning, statistics and pattern recognition etc research categories.
Separate subsections: statistical analysis, association rules, clustering, classification, sequential pattern, dependency Modeling.
![Page 54: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/54.jpg)
Statistical Analysis - the analysts may perform different kinds of descriptive statistical analyses based on different variables when analyzing the session file ; powerful tools in extracting knowledge about visitors to a Web site.
![Page 55: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/55.jpg)
Association Rules - refers to sets of pages that are accessed together with a support value exceeding some specified threshold.
Clustering: a technique to group together users or data items (pages) with the similar characteristics. It can facilitate the development and
execution of future marketing strategies. Classification: the technique to map a data
item into one of several predefined classes, which help to establish a profile of users belonging to a particular class or category.
![Page 56: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/56.jpg)
Pattern Analysis
Pattern Analysis - final stage of the Web usage mining.
To eliminate the irrelative rules or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process.
Analysis methodologies and tools: query mechanism like SQL, OLAP, visualization
etc.
![Page 57: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/57.jpg)
![Page 58: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/58.jpg)
WUM – Pre-Processing Data Cleaning
Removes log entries that are not needed for the mining process
Data Integration Synchronize data from multiple server logs, metadata
User Identification Associates page references with different users
Session/Episode Identification Groups user’s page references into user sessions
Page View IdentificationPath Completion
Fills in page references missing due to browser and proxy caching
![Page 59: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/59.jpg)
WUM – Issues in User Session Identification
A single IP address is used by many users
Different IP addresses in a single session
Missing cache hits in the server logs
different usersdifferent users Proxy Proxy serverserver Web serverWeb server
ISP serverISP server Web serverWeb serverSingle userSingle user
![Page 60: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/60.jpg)
User and Session Identification Issues
Distinguish among different users to a site Reconstruct the activities of the users within
the site Proxy servers and anonymizers Rotating IP addresses connections through
ISPs Missing references due to caching Inability of servers to distinguish among
different visits
![Page 61: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/61.jpg)
WUM – Solutions
Remote AgentA remote agent is implemented in Java Applet
It is loaded into the client only once when the first page is accessed
The subsequent requests are captured and send back to the server
Modified Browser The source code of the existing browser can be modified to gain user
specific data at the client side
Dynamic page rewritingWhen the user first submit the request, the server returns the requested page rewritten to include a session specific ID
Each subsequent request will supply this ID to the server
HeuristicsUse a set of assumptions to identify user sessions and find the missing cache hits in the server log
![Page 62: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/62.jpg)
![Page 63: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/63.jpg)
WUM – Heuristics
The session identification heuristicsTimeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new sessionIP/Agent: Each different agent type for an IP address represents a different sessionsReferring page: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different sessionSame IP-Agent/different sessions (Closest): Assigns the request to the session that is closest to the referring page at the time of the requestSame IP-Agent/different sessions (Recent): In the case where multiple sessions are same distance from a page request, assigns the request to the session with the most recent referrer access in terms of time
![Page 64: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/64.jpg)
Cont.
The path completion heuristicsIf the referring page file of a session is not part of the previous page file of that session, the user must have accessed a cached pageThe “back” button method is used to refer a cached page Assigns a constant view time for each of the cached page file
![Page 65: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/65.jpg)
![Page 66: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/66.jpg)
![Page 67: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/67.jpg)
![Page 68: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/68.jpg)
![Page 69: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/69.jpg)
![Page 70: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/70.jpg)
WUM – Association Rule Generation
Discovers the correlations between pages that are most often referenced together in a single server session
Provide the informationWhat are the set of pages frequently accessed together by Web users?
What page will be fetched next?
What are paths frequently accessed by Web users?
Association rule
A B [ Support = 60%, Confidence = 80% ]
Example
“50% of visitors who accessed URLs /infor-f.html and labo/infos.html also visited situation.html”
![Page 71: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/71.jpg)
Associations & Correlations
Page associations from usage data– User sessions– User transactions
Page associations from content data– similarity based on content analysis
Page associations based on structure– link connectivity between pages
==> Obtain frequent itemsets
![Page 72: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/72.jpg)
Examples:
60% of clients who accessed /products/, also accessed /products/software/webminer.htm.
30% of clients who accessed /special-offer.html, placed an online order in /products/software/.
(Example from IBM official Olympics Site) {Badminton, Diving} ===> {Table Tennis}
(69.7%,.35%)
![Page 73: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/73.jpg)
WUM – Clustering
Groups together a set of items having similar characteristics
User ClustersDiscover groups of users exhibiting similar browsing patternsPage recommendation
User’s partial session is classified into a single clusterThe links contained in this cluster are recommended
![Page 74: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/74.jpg)
Cont..
Page clustersDiscover groups of pages having related content Usage based frequent pages Page recommendation
The links are presented based on how often URL references occur together across user sessions
![Page 75: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/75.jpg)
Website Usage Analysis
Why developing a Website usage / utilization
analyzation tool?
Knowledge about how visitors use Website couldKnowledge about how visitors use Website could
- Prevent disorientation and help designers place
important information/functions exactly where the
visitors look for and in the way users need it
- Build up adaptive Website server
![Page 76: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/76.jpg)
Clustering and Classification
clients who often access /products/software/webminer.html tend to be from
educational institutions.
clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.
75% of clients who download software from
/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.
![Page 77: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/77.jpg)
Website Usage Analysis
Discover user navigation patterns in using
Website
- Establish a aggregated log structure as a
preprocessor to reduce the search space before
the actual log mining phase
- Introduce a
model for Website usage pattern discovery by
extending the classical mining model, and
establish the processing framework of this model
![Page 78: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/78.jpg)
Sequential Patterns & Clusters
30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit
60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days
![Page 79: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/79.jpg)
Website Usage Analysis
Website client-server architecture facilitates
recording user behaviors in every steps by
- submit client-side
log files to server when users use clear functions or
exit window/modules
The special design for local and universal
back/forward/clear functions makes user’s
navigation pattern more clear for designer by
- analyzing local back/forward history and incorporate
it with universal back/forward history
![Page 80: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/80.jpg)
Website Usage Analysis
What will be included in SUA
1. Identify and collect log data
2. Transfer the data to server-side and save them in a
structure desired for analysis
3. Prepare mined data by establishing a customized
aggregated log tree/frame
4. Use modifications of the typical data mining
methods, particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns
![Page 81: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/81.jpg)
Website Usage Analysis
Problem need to be considered:
- How to identify the log data when a user go through
uninteresting function/module
- What marks the end of a user session?
- Client connect Website through proxy servers
Differences in Website usage analysis with common Web usage mining
- Client-side log files available
- Log file’s format (Web log files follow Common Log Format specified as a part of HTTP protocol)
- Not necessary for log file cleaning/filtering (which usually performed in preprocess of Web log mining)
![Page 82: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/82.jpg)
Web Usage Mining - Patterns Discovery Algorithms
(Chen et. al.) Design algorithms for Path Traversal Patterns, finding maximal forward references and large reference sequences.
![Page 83: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/83.jpg)
Path Traversal Patterns
Procedure for mining traversal patterns: (Step 1) Determine maximal forward
references from the original log data (Algorithm MF)
(Step 2) Determine large reference sequences (i.e., Lk, k1) from the set of maximal forward references (Algorithm FS and SS)
(Step 3) Determine maximal reference sequences from large reference sequences
Focus on Step 1 and 2, and devise algorithms for the efficient determination of large reference sequences
![Page 84: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/84.jpg)
Determine large reference sequeces
Algorithm FS: Utilizes the key ideas of algorithm DHP: employs hashing and pruning techniques DHP is very efficient for the generation of candidate
itemsets, in particular for the large two-itemsets, thus greatly improving the performance bottleneck of the whole process
Algorithm SS: employs hashing and pruning techniques to reduce both
CPU and I/O costs by properly utilizing the information in candidate
references in prior passes, is able to avoid database scans in some passes, thus further reducing the disk I/O cost
![Page 85: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/85.jpg)
Patterns Analysis Tools
WebViz [pitkwa94] --- provides appropriate tools and techniques to understand, visualize, and interpret access patterns.
Proposes OLAP techniques such as data cubes for the purpose of simplifying the analysis of usage statistics from server access logs. [dyreua et al]
![Page 86: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/86.jpg)
Patterns Discovery and Analysis Tools
The emerging tools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collected data: (Pirolli et. al.) use information foraging theory to
combine path traversal patterns, Web page typing, and site topology information to categorize pages for easier access by users.
![Page 87: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/87.jpg)
(Cont’d)
WEBMINER : introduces a general architecture for Web usage
mining, automatically discovering association rules and sequential patterns from server access logs.
proposes an SQL-like query mechanism for querying the discovered knowledge in the form of association rules and sequential patterns.
WebLogMiner Web log is filtered to generate a relational database Data mining on web log data cube and web log
database
![Page 88: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/88.jpg)
WEBMINER
SQL-like Query A framework for Web mining, the
applications of data mining and knowledge discovery techniques, association rules and sequential patterns, to Web data: Association rules: using apriori algorithm
40% of clients who accessed the Web page with URL /company/products/product1.html, also accessed /company/products/product2.html
Sequential patterns: using modified apriori algorithm 60% of clients who placed an online order in
/company/products/product1.html, also placed an online order in /company/products/product4.html within 15 days
![Page 89: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/89.jpg)
WebLogMiner
Database construction from server log file: data cleaning data transformation
Multi-dimensional web log data cube construction and manipulation
Data mining on web log data cube and web log database
![Page 90: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/90.jpg)
Mining the World-Wide Web
Design of a Web Log Miner Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge
G)p,q( )q(reedegout
)q(R)1(n/)p(R
1Data Cleaning
2Data CubeCreation
3OLAP
4Data Mining
Web log Database Data Cube Sliced and dicedcube
Knowledge
![Page 91: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/91.jpg)
Construction of Data Cubes(http://db.cs.sfu.ca/sections/publication/slides/slides.html)
sum
0-20K20-40K 60K- sum
Comp_Method
… ...
sum
Database
Amount
Province
Discipline
40-60KB.C.
PrairiesOntario
All AmountComp_Method, B.C.
Each dimension contains a hierarchy of values for one attributeA cube cell stores aggregate values, e.g., count, sum, max, etc.A “sum” cell stores dimension summation values.Sparse-cube technology and MOLAP/ROLAP integration.“Chunk”-based multi-way aggregation and single-pass computation.
![Page 92: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/92.jpg)
WebLogMiner Architecture
Web log is filtered to generate a relational database
A data cube is generated from database OLAP is used to drill-down and roll-up in
the cube OLAM is used for mining interesting
knowledge
G)p,q( )q(reedegout
)q(R)1(n/)p(R
1Data Cleaning
2Data CubeCreation
3OLAP
4Data Mining
Web logDatabase
Data Cube Sliced and dicedcube
Knowledge
![Page 93: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/93.jpg)
HITS Algorithm--Topic Distillation on WWW
![Page 94: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/94.jpg)
HITS Method
Hyperlink Induced Topic Search Kleinberg, 1998 A simple approach by finding hubs and authorities View web as a directed graph Assumption: if document A has hyperlink to
document B, then the author of document A thinks that document B contains valuable information
![Page 95: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/95.jpg)
Main Ideas
Concerned with the identification of the most authoritative, or definitive, Web pages on a broad-topic
Focused on only one topic
Viewing the Web as a graph
A purely link structure-based computation, ignoring the textual content
![Page 96: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/96.jpg)
HITS: Hubs and Authority
Hub: web page links to a collection of prominent sites on a common topic
Authority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs
Mutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities
![Page 97: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/97.jpg)
Hub-Authority Relations
Hubs Authorities
Unrelated page of large in-degree
![Page 98: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/98.jpg)
HITS: Two Main Steps
A sampling component, which constructs a focused collection of several thousand web pages likely to be rich in relevant authorities
A weight-propagation component, which determines numerical estimates of hub and authority weights by an iterative procedure
As the result, pages with highest weights are returned as hubs and authorities for the research topic
![Page 99: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/99.jpg)
HITS: Root Set and Base Set
Using query term to collect a root set (S) of pages from index-based search engine (AltaVista)
Expand root set to base set (T) by including all pages linked to by pages in root set and all pages that link to a page in root set (up to a designated size cut-off)
Typical base set contains roughly 1000-5000 pages
![Page 100: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/100.jpg)
Step 1: Constructing Subgraph
1.1 Creating a root set (S) - Given a query string on a broad topic - Collect the t highest-ranked pages for the query from a text-based search engine
1.2 Expanding to a base set (T) - Add the page pointing to a page in root set - Add the page pointed to by a page in root set
![Page 101: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/101.jpg)
Root Set and Base Set (Cont’d)
ST
S
![Page 102: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/102.jpg)
Step 2: Computing Hubs and Authorities
2.1 Associating weights
- Authority weight xp
- Hub weight yp
- Set all values to a uniform constant initially
2.2 Updating weights
![Page 103: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/103.jpg)
Updating Authority Weight
q such that qpxp = yq
P
q1
q3
q2
xp=yq1+yq2+yq3
Example
![Page 104: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/104.jpg)
Updating Hub Weight
yp = xqq such that pq
Example
yp=xq1+xq2+xq3
P
q1
q3
q2
![Page 105: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/105.jpg)
Flowchart
Updateall x-
values
Initialization Set all values to c, e.g. c =1
Updateall y-
values
Updateall y-
values
Updateall x-
values
1st time
2nd time
![Page 106: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/106.jpg)
Results
All x- and y-values converge rapidly so that termination of the iteration is guaranteed
It can be proved in mathematical approach
Pages with the highest x-values are viewed as the best authorities, while pages with the highest y-values are regarded as the best hubs
![Page 107: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/107.jpg)
Implementation
Search engine: AltaVista Root set: 200 pages Base set: 1000-5000
pages Converging speed: Very rapid, less than 20
times Running time: About 30
minutes
![Page 108: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/108.jpg)
HITS: Advantages
Weight computation is an intrinsic feature from collection of linked pages
Provides a densely linked community of related authorities and hubs
Pure link-based computation once the root set has been assembled, with no further regard to query terms
Provides surprisingly good search result for a wide range of queries
![Page 109: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/109.jpg)
Drawbacks
Limit On Narrow Topics Not enough authoritative pages Frequently returns resources for a
more general topic adding a few edges can potentially
change scores considerably
Topic Drifting- Appear when hubs discuss multiple
topics
![Page 110: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/110.jpg)
To improve precision:- Combining content with link information- Breaking large hub pages into smaller units- Computing relevance weights for pages
To improve speed: - Building a Connectivity Server that provides
linkage information for all pages
Improved Work
![Page 111: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/111.jpg)
Web Structure Mining
Page-Rank Method CLEVER Method Connectivity-Server Method
![Page 112: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/112.jpg)
1. Page-Rank Method
Introduced by Brin and Page (1998) Mine hyperlink structure of web to produce ‘global’
importance ranking of every web page Used in Google Search Engine Web search result is returned in the rank order Treats link as like academic citation Assumption: Highly linked pages are more ‘important’ than
pages with a few links A page has a high rank if the sum of the ranks of its back-
links is high
![Page 113: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/113.jpg)
Page Rank: Computation
Assume: R(u) : Rank of a web page u Fu : Set of pages which u points to
Bu : Set of pages that points to u
Nu : Number of links from u C : Normalization factor E(u) : Vector of web pages as source of rank
Page Rank Computation:
)()(
)( ucEN
vRcuR
uBv v
![Page 114: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/114.jpg)
Page Rank: Implementation
Stanford WebBase project Complete crawling and indexing system of with current repository 24 million web pages (old data)
Store each URL as unique integer and each hyperlink as integer IDs
Remove dangling links by iterative procedures Make initial assignment of the ranks Propagate page ranks in iterative manner Upon convergence, add the dangling links back and
recompute the rankings
![Page 115: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/115.jpg)
Page Rank: Results
Google utilizes a number of factors to rank the search results: proximity, anchor text, page rank
The benefits of Page Rank are the greatest for underspecified queries, example: ‘Stanford University’ query using Page Rank lists the university home page the first
![Page 116: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/116.jpg)
Page Rank: Advantages
Global ranking of all web pages – regardless of their content, based solely on their location in web graph structure
Higher quality search results – central, important, and authoritative web pages are given preference
Help find representative pages to display for a cluster center
Other applications: traffic estimation, back-link predictor, user navigation, personalized page rank
Mining structure of web graph is very useful for various information retrieval
![Page 117: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/117.jpg)
CLEVER Method
CLient–side EigenVector-Enhanced Retrieval Developed by a team of IBM researchers at IBM
Almaden Research Centre Continued refinements of HITS Ranks pages primarily by measuring links between
them Basic Principles – Authorities, Hubs
Good hubs points to good authorities Good authorities are referenced by good hubs
![Page 118: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/118.jpg)
Problems Prior to CLEVER
Textual content that is ignored leads to problems caused by some features of web: HITS returns good resources for more general topic when
query topics are narrowly-focused HITS occasionally drifts when hubs discuss multiple topics Usually pages from single Web site take over a topic and
often use same html template therefore pointing to a single popular site irrelevant to query topic
![Page 119: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/119.jpg)
CLEVER: Solution
Replacing the sums of Equation (1) and (2) of HITS with weighted sums
Assign to each link a non-negative weight Weight depends on the query term and end point Extension 1: Anchor Text
using text that surrounds hyperlink definitions (href’s) in Web pages, often referred as ‘anchor text’
boost weight enhancements of links that occur near instances of query terms
![Page 120: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/120.jpg)
CLEVER: Solution (Cont’d)
Extension 2: Mini Hub Pagelets breaking large hub into smaller units treat contiguous subsets of links as mini-hubs or
‘pagelets’ contiguous sets of links on a hub page are more
focused on single topic than the entire page
![Page 121: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/121.jpg)
CLEVER: The Process
Starts by collecting a set of pages Gathers all pages of initial link, plus any pages
linking to them Ranks result by counting links Links have noise, not clear which pages are best Recalculate scores Pages with most links are established as most
important, links transmit more weigh Repeat calculation no. of times till scores are
refined
![Page 122: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/122.jpg)
CLEVER: Advantages
Used to populate categories of different subjects with minimal human assistance
Able to leverage links to fill category with best pages on web
Can be used to compile large taxonomies of topics automatically
Emerging new directions: Hypertext classification, focused crawling, mining communities
![Page 123: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/123.jpg)
Connectivity Server Method Server that provides linkage information
for all pages indexed by a search engine In its base operation, server accepts a
query consisting of a set of one or more URLs and return a list of all pages that point to pages in (parents) and list of all pages that are pointed to from pages in (children)
In its base operation, it also provides neighbourhood graph for query set
Acts as underlying infrastructure, supports search engine applications
![Page 124: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/124.jpg)
What’s Connectivity Server (Cont’d)
Neighborhood GraphNeighborhood Graph
![Page 125: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/125.jpg)
CONSERV: Web Structure Mining
Finding Authoritative Pages (Search by topic)(pages that is high in quality and relevant to the
topic)
Finding Related Pages (Search by URL)(pages that address same topic as the original
page, not necessarily semantically identical)Algorithms include Companion, Cocitation
![Page 126: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/126.jpg)
CONSERV: Finding Related Page
![Page 127: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/127.jpg)
CONSERV: Companion Algorithm
An extension to HITS algorithmFeatures: Exploit not only links but also their order on a page Use link weights to reduce the influence of pages
that all reside on one host Merge nodes that have a large number of duplicate
links The base graph is structured to exclude
grandparent nodes but include nodes that share child
![Page 128: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/128.jpg)
Companion Algorithm (Cont’d)
Four steps1. Build a vicinity graph for u2. Remove duplicates and near-duplicates in
graph.3. Compute link weights based on host to host
connection4. Compute a hub score and a authority score
for each node in the graph, return the top ranked authority nodes.
![Page 129: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/129.jpg)
Companion Algorithm (Cont’d)Building the Vicinity Graph
Set up parameters: B : no of parents of u, BF : no of children per parent, F : no of children of u, FB : no of parents per child
Stoplist (pages that are unrelated to most queries and have a very high in-degree)
ProcedureGo Back (B) : choose parents (randomly)Back-Forward(BF) : choose siblings (nearest)Go Forward (F) : choose children (first) Forward-Back(FB) : choose siblings (highest
in-degree)
![Page 130: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/130.jpg)
Companion Algorithm (Cont’d)Remove duplicate
Near-duplicate, if two nodes, each has more than 10 links and they have at least 95% of their links in common
Replace two nodes with a node whose links are the union of the links of the two nodes
(mirror sites, aliases)
![Page 131: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/131.jpg)
Companion Algorithm (Cont’d)Assign edge (link) weights
Link on the same host has weight 0 If there are K links from documents on a
host to a single document on diff host, each link has an authority weight of 1/k
If there are k links from a single document on a host to a set of documents on diff host, give each link a hub weight of 1/k
(prevent a single host from having too much influence on the computation)
![Page 132: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/132.jpg)
Companion Algorithm (Cont’d)Compute hub and authority scores
Extension of the HITS algorithm with edge weights Initialize all elements of the hub vector H to 1 Initialze all elements of the authority vector A to 1 While the vectors H and A have not converged: For all nodes n in the vicinity graph N, A[n] := (n',n)edges(N) H[n'] x
authority_weight(n',n) For all n in N, H[n] := (n',n)edges(N) A[n'] x hub_weight(n',n) Normalize the H and A vectors.
![Page 133: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/133.jpg)
CONSERV: Cocitation Algorithm
Two nodes are co-cited if they have a common parent
The number of common parents of two nodes is their degree of co-citation
Determine the related pages by looking for sibling nodes with the highest degree of co-citation
In some cases there is an insufficient level of cocitation to provide meaningful results, chop off elements of URL, restart algorithm.e.g. A.com/X/Y/Z A.com/X/Y
![Page 134: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/134.jpg)
Comparative Study
Page Rank (Google) Assigns initial ranking
and retains them independently from queries (fast)
In the forward direction from link to link
Qualitative result
Hub/Authority (CLEVER, C-Server) Assembles different root
set and prioritizes pages in the context of query
Looks forward and backward direction
Qualitative result
![Page 135: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/135.jpg)
Connectivity-Based Ranking
Query-independent: gives an intrinsic quality score to a page
Approach #1: larger number of hyperlinks pointing to a page, the better the page drawback? each link is equally important
Approach #2: weight each hyperlink proportionally to the quality of the page containing the hyperlink
![Page 136: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/136.jpg)
Query-dependent Connectivity-Based Ranking
Carrier and Kazman For each query, build a subgraph of the
link graph G limited to pages on query topic
Build the neighborhood graph1. A start set S of documents matching query
given by search engine (~200)2. Set augmented by its neighborhood, the set of
documents that either point to or are pointed to by documents in S (limit to ~50)
3. Then rank based on indegree
![Page 137: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/137.jpg)
Idea
We desire pages that are relevant (in the neighborhood graph) and authoritative
As in page rank, not only the in-degree of a page p, but the quality of the pages that point to p. If more important pages point to p, that means p is more authoritative
Key idea: Good hub pages have links to good authority pages
given user query, compute a hub score and an authority score for each document
high authority score relevant content high hub score links to documents with
relevant content
![Page 138: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/138.jpg)
Improvements to Basic Algorithm
Put weights on edges to reflect importance of links, e.g., put higher weight if anchor text associated with the link is relevant to query
Normalize weights outgoing from a single source or coming into a single sink. This alleviates spamming of query results
Eliminate edges between same domain
![Page 139: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/139.jpg)
Discovering Web communities on the web
![Page 140: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/140.jpg)
Introduction
Introduction of the cyber-community
Methods to measure the similarity of web pages on the web graph
Methods to extract the meaningful communities through the link structure
![Page 141: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/141.jpg)
What is cyber-community
A community on the web is a group of web pages sharing a common interest Eg. A group of web pages talking about POP
Music Eg. A group of web pages interested in data-
mining Main properties:
Pages in the same community should be similar to each other in contents
The pages in one community should differ from the pages in another community
Similar to cluster
![Page 142: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/142.jpg)
Two different types of communities
Explicitly-defined communities They are well known ones,
such as the resource listed by Yahoo!
Implicitly-defined communities They are communities
unexpected or invisible to most users
Arts
Music
Classic Pop
Painting
eg.
eg. The group of web pages interested in a particular singer
![Page 143: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/143.jpg)
Two different types of communities
The explicit communities are easy to identify Eg. Yahoo!, InfoSeek, Clever System
In order to extract the implicit communities, we need analyze the web-graph objectively
In research, people are more interested in the implicit communities
![Page 144: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/144.jpg)
Similarity of web pages
Discovering web communities is similar to clustering. For clustering, we must define the similarity of two nodes
A Method I: For page and page B, A is related to B if there is a
hyper-link from A to B, or from B to A
Not so good. Consider the home page of IBM and Microsoft.
Page A
Page B
![Page 145: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/145.jpg)
Similarity of web pages
Method II (from Bibliometrics) Co-citation: the similarity of A and B is
measured by the number of pages cite both A and B
Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.
Page A Page B
Page A Page B
![Page 146: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/146.jpg)
Methods of clustering
Clustering methods based on co-citation analysis:
Methods derived from HITS (Kleinberg) Using co-citation matrix
All of them can discover meaningful communitiesBut their methods are very expensive to the whole World Wide Web with billions of web pages.
![Page 147: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/147.jpg)
A cheaper method
The method from Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins IBM Almaden Research Center
They call their method communities trawling (CT)
They implemented it on the graph of 200 millions pages, it worked very well
![Page 148: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/148.jpg)
Basic idea of CT
Definition of communities
dense directed bipartite sub graphs Bipartite graph: Nodes
are partitioned into two sets, F and C
Every directed edge in the graph is directed from a node u in F to a node v in C
dense if many of the possible edges between F and C are present
Fans Centers
F C
![Page 149: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/149.jpg)
Basic idea of CT
Bipartite cores a complete bipartite
subgraph with at least i nodes from F and at least j nodes from C
i and j are tunable parameters
A (i, j) Bipartite core
Every community have such a core with a certain i and j.
A (i=3, j=3) bipartite core
![Page 150: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/150.jpg)
Basic idea of CT
A bipartite core is the identity of a community
To extract all the communities is to enumerate all the bipartite cores on the web.
Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -- elimination-generation pruning
![Page 151: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/151.jpg)
• Complete bipartite graph: there is an edge between each node in F and each node in C
• (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C
• (i,j)-Core is a good signature for finding online communities
•“Trawling”: finding cores
• Find all (i,j)-cores in the Web graph.
– In particular: find “fans” (or “hubs”) in the graph
– “centers” = “authorities”
– Challenge: Web is huge. How to find cores efficiently?
![Page 152: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/152.jpg)
Main idea: pruning
• Step 1: using out-degrees
– Rule: each fan must point to at least 6 different websites
– Pruning results: 12% of all pages (= 24M pages) are potential fans
– Retain only links, and ignore page contents
![Page 153: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/153.jpg)
Many pages are mirrors (exactly the same page)
They can produce many spurious fans Use a “shingling” method to identify
and eliminate duplicates Results: – 60% of 24M potential-fan pages are
removed – # of potential centers is 30 times of #
of potential fans
Step 2: Eliminate mirroring pages
![Page 154: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/154.jpg)
Step 4: Iterative pruning
To find (i,j)-cores– Remove all pages whose # of out-links is < i– Remove all pages whose # of in-links is < j– Do it iteratively
Step 5: inclusion-exclusion pruning Idea: in each step, we – Either “include” a community” – Or we “exclude” a page from further contention
![Page 155: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/155.jpg)
Check a page x with j out-degree. x is a fan of a (i,j)-core if:
– There are i-1 fans point to all the forward neighbors of x
– This step can be checked easily using the index on fans and centers
Result: for (3,3)-cores, 5M pages remained Final step: – Since the graph is much smaller, we can afford
to “enumerate” the remaining cores
![Page 156: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/156.jpg)
Step 3: using in-degrees of pages Delete pages highly references, e.g.,
yahoo, altavista Reason: they are referenced for many
reasons, not likely forming an emerging community
Formally: remove all pages with more than k inlinks (k = 50,for instance)
Results:– 60M pages pointing to 20M pages– 2M potential fans
![Page 157: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/157.jpg)
Weakness of CT
The bipartite graph cannot suit all kinds of communities
The density of the community is hard to adjust
![Page 158: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/158.jpg)
Experiment on CT
200 millions web pages
IBM PC with an Intel 300MHz Pentium II processor, with 512M of memory, running Linux
i from 3 to 10 and j from 3 to 20
200k potential communities were discovered29% of them cannot be found in Yahoo!.
![Page 159: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/159.jpg)
Summary
Conclusion: The methods to discover communities from the web depend on how we define the communities through the link structure
Future works: How to relate the contents to link structure
![Page 160: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/160.jpg)
Web communities based on dense bipartite graph patterns (WISE’01)
By Krishna Reddy and Masaru Kitsuregawa
![Page 161: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/161.jpg)
Aim/Motivation
To find all the communities within a large collection of web pages.
Proposed solution:
•Analyze linkage patterns
•Find DBG in the given collection of webpages
![Page 162: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/162.jpg)
DefinitionsBipartite graphA BG is a graph which can be partitioned into two non-empty sets T and I. Every directed edge of BG joins a node in T to a node in I
Dense Bipartite graphA DBG is a BG where each node of T establishes an edge with at least alpha nodes of I and each node of I has atleast beta nodes as parents to it CommunityThe set T contains the members of the community if there exist a DBG(T,I,alpha,beta) where alpha>= alpha_t and beta>=beta_t Where alpha_t and beta_t > 0.
![Page 163: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/163.jpg)
DBG(T,I,p,q)
a
b
c
d
s
t
u
p q
![Page 164: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/164.jpg)
Definitions
Cocite: Association among pages based on the existence of common children (URL’s).
Relax Cocite: we allow u,v,w to group if cocite(u,v) and cocite(v,w) are true.
p
q
a
b
c
d
e
f
i)
p
b
q
r
a
c
d
e
fg
![Page 165: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/165.jpg)
Algorithm
1.For a given URL find T(set of URL’s). Relax-cocite factor is 1.
a)While num_iterations<=n At a fixed relax-cocite factor value,find all w’s
such that relax-cocite(w,y) =true T= w U T2. Community extraction
Input contains Page_set,outputDBG(T,I,alpha,beta)
Edge file has <p,q> where p is the parent of q.
![Page 166: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/166.jpg)
Algorithm(contd…)
For each P belongs to T,insert the edge<p,q> in edge_file if q belongs child(q).
Sort edge file based on source.Prepare T1 with<source,freq>.Remove <p,q> from edge_file if freq<alpha.
Sort the edge_file based on destination.Prepare I1 with<q,freq>.Remove<p,q> from edgefile if freq<beta.
The result is a DBG(T,I,alpha,beta).
![Page 167: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/167.jpg)
Advantages/Disadvantages
Extracts all DBG’s in a pageset. Community extracted is significantly large.
DISADV:
Need a URL to start with. Community members need links to be a
part of the community
![Page 168: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/168.jpg)
Efficient Identification of Web Communities
Gary William Flake, Steve Lawrence & C. Lee Giles
![Page 169: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/169.jpg)
Presentation Structure
Introduction or why they did it!
Motivation Background
Theory or how they did it!
Definition Algorithm
Experimentation or how did they do!
Results Conclusions
![Page 170: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/170.jpg)
Motivation
Exploding Web ~ 1,000,000,000 documents Search Engine Limitations
Crawling the web 16% Maximum Coverage!
Updating the web Precision vs Recall
Web Communities Balanced Min Cut Identification is NP-hard
![Page 171: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/171.jpg)
Background
Bibliometrics, Citation analysis, Social Networks
Classical Clustering eg. CORA
HITS hubs and authorities
![Page 172: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/172.jpg)
s-t Max Flow & Min Cut
Floyd & Fulkerson’s Max Flow = Min Cut Theorem
Incremental Shortest Augmentation algorithm in poly-time
•Capacity weights
•Source & Sink
•Water In, Water Out!
G(V,E)
![Page 173: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/173.jpg)
The Idea
The Ideal Community C V
Theorem1: A community C can be identified by calculating the s-t minimum cut using appropriately chosen source and sink nodes.
Proof by Contradiction
![Page 174: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/174.jpg)
The Algorithm
1. Choose Source(s) and Sink(s)2. Generate G(V,E) using crawler3. Find s-t Min Cut
•Virtual Sources & Sinks•Choosing the Source•Choosing the Sink
Source layers Sink layers
![Page 175: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/175.jpg)
Expectation Maximization
Implementation Issues Small size G(V,E) = low recall Dependent on choice of source set
Recurse over Algorithm Community obtained in one iteration used as
input to next iteration
Termination not guaranteed
![Page 176: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/176.jpg)
Experimental Results
Testing neighborhoods … Support Vector Machine (SVM) The Internet Archive Ronald Rivest
Criterion Precision & Recall Seed set size Running time
![Page 177: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/177.jpg)
SVM Community
Characterization Recent: Not listed in any portal Relatively small research community
Seed Set svm.first.gmd.de, svm.research.bell-labs.com,
www.clrc.rhbnc.ac.uk/research/SVM, www.support-vector.net
Performance 4 iterations of EM 11,000 URLs in the graph, 252 member web pages
![Page 178: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/178.jpg)
Internet Archive Community
CharacterizationLarge, internal communities
Seed Set : 11 URLs Performance
2 iterations of EM 7,000 URLs, 289 web pages
![Page 179: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/179.jpg)
Ronald Rivest Community
Characterization Community around an individual
Seed set http://theory.lcs.mit.edu/~rivest
Performance 4 iterations of EM 38,000 URLs, 150 pages Cormen’s pages as 1st and 3rd result
![Page 180: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/180.jpg)
Summary
Actual running time 1 sec on a 500 MHz Intel machine
Max Flow Framework EM Approach Relevancy test
![Page 181: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/181.jpg)
Applications
Focused crawlers Increased Precision & Coverage
Automated population of portal categories Recall Addressed
Improved filtering Keyword Spamming Topical Pruning – eg. Pornography
![Page 182: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/182.jpg)
Future Work
Generalize the notion of Community Parameterize with coupling factor
Low value, weakly connected communities High value, highly connected communities
Ideal community Co-learning and Co-boosting
![Page 183: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/183.jpg)
References
L. Page, S. Brin, "PageRank: Bringing order to the Web," Stanford Digital Libraries working paper 1997-0072.
Chakrabarti, Dom, Kumar, “Mining the link structure of the World Wide Web,” IEEE Computer, 32(8), August 1999
K. Bharat, A. Broder, “The Connectivity Server: Fast access to linkage information on the Web.” In Ashman and Thistlewaite [2], pages 469--477. Brisbane, Australia, 1998
B. Allan, “Finding Authorities and Hubs from Link Structures on the World Wide Web”, ACM, May 2001
Jeffrey Dean “Finding Related Pages in the World Wide Web” http://citeseer.nj.nec.com/dean99finding.html
A. Z. Border,… Graph structure in the web: experiments and models. Proc. 9th WWW Conf., 2000.
S. R. Kumar,… Trawling emerging cyber-communities automatically. Proc. 8th WWW Conf., 1999.
![Page 184: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/184.jpg)
References Principles of Data Mining, Hand, Mannila, Smyth. MIT Press,
2001. Notes from Dr. M.V. Ramakrishna
http://goanna.cs.rmit.edu.au/~rama/cs442/info.html Notes from CS 395T: Large-Scale Data Mining, Inderjit Dhillon
http://www.cs.utexas.edu/users/inderjit/courses/dm2000.html Link Analysis in Web Information Retreival, Monika Henzinger.
Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000. research.microsoft.com/research/db/debull/A00sept/henzinge.ps
slides from Data Mining: Concepts and Techniques, Jan and Kamber, Morgan Kaufman, 2001.
![Page 185: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/185.jpg)
1. J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data
, SIGKDD Explorations, Vol. 1, Issue 2, 2000. 2. B. Mobasher, R. Cooley and J. Srivastava,
Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
3. B. Mobasher, Namit Jain, Eui-Hong (Sam) Han, Jaideep Srivastava. Web Mining: Pattern Discovery from World Wide Web Transactions. Technical Report TR 96-060, University of Minnesota, Dept. of Computer Science, Minneapolis, 1996
![Page 186: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/186.jpg)
4. R. Cooley, P. N. Tan., and J. Srivastava. (1999). WebSIFT: the Web site information filter system. In Proceedings of the 1999 KDD Workshop on Web Mining, San Diego, CA. Springer-Verlag, in press.
5. R. W. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web data. PhD Thesis, Dept of Computer Science, University of Minnesota, May 2000.
6. Cooley, R., Mobasher, B., and Srivastava, J. Web Mining: Information and pattern Discovery on the World Wide Web. IEEE Computer, pages 558-566, 1997.
7. Etzioni, O. The world wide web: Quagmire or gold mine. Communications of the ACM, 39(11):65-68, 1996.
8. Kosala, R. and Blockeel, H. Web Mining Research: A summary. SIGKDD Explorations, 2(1):1-15, 2000.
![Page 187: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/187.jpg)
Fayyad, U., Djorgovski, S., and Weir, N. Automating the analysis and cataloging of sky surveys. In Advances in Knowledge Discovery and Data Mining, pages 471-493. AAAI Press, 1996.
Langley, P. User modeling in adaptive interfaces. In Proceedings of the Seventh International Conference on User Modeling, pages 357-370, 1999.
Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim, E.-P. Research issues in web data mining. In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ‘99, pages 303-312, 1999.
Masand, B. and Spiliopoulou, M. Webkdd-99: Work-shop on web usage analysis and user profiling. SIGKDD Explorations, 1(2), 2000.
![Page 188: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/188.jpg)
Smyth, P., Fayyad, U.M., Burl, M.C., and Perona, P. Modeling subjective uncertainty in image annotation. In Advances in Knowledge Discovery and Data Mining, pages 517-539, 1996.
Spiliopoulou, M. Data mining for the web. In Principles of Data Mining and Knowledge Discovery, Second European Symposium, PKDD ‘99, pages 588-589, 1999.
Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. Web usage mining: Discovery and applications of usage patterns from web data. SIGMOD Explorations, 1(2), 2000.
Zaiane, O.R., Xin, M., and Han, J. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. IEEE, pages 19-29, 1998.
![Page 189: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/189.jpg)
Page Ranking
o The PageRank Citation Ranking: Bringing Order to the Web (1998), Larry Page, Sergey Brin, R. Motwani, T. Winograd, Stanford Digital Library Technologies Project..
o Authoritative Sources in a Hyperlinked Environment (1998),
Jon. Kleinberg, Journal of the ACM o The Anatomy of a Large-Scale Hypertextual
Web Search Engine (1998) Sergey Brin, Lawrence Page, Computer Networks and ISDN Systems.
o Web Search Via Hub Synthesis (1998) Dimitris Achlioptas,
Amos Fiat, Anna R. Karlin, Frank McSherry. o What is this Page Known for? Computing Web Page Reputatio
ns (2000) Davood Rafiei, Alberto O Mendelzon.
![Page 190: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/190.jpg)
o Link Analysis in Web Infromation Retrieval, Monika Henzinger. Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000.
Finding Authorities and Hubs From Link Structures on the World Wide Web, Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas, 2002.
Web Communities and Classification Enhanced hypertext categorization using hyperlinks (1998) Soumen
Chakrabarti, Byron Dom, and Piotr Indyk, Proceedings of SIGMOD-98, ACM International Conference on Management of Data.
Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (1998) S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan, Proceedings of the 7th International World Wide Web Conference.
Inferring Web Communities from Link Topology (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on Hypertext.
![Page 191: Web Mining : A Bird’s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649de55503460f94adde77/html5/thumbnails/191.jpg)
o Trawling the web for emerging cyber-communities (1999) Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / Computer Networks.
o Finding Related Pages in the World Wide Web
(1999) Jeffrey Dean, Monika R. Henzinger, WWW8 / Computer Networks.
o A System for Collaborative Web Resource Categori
zation and Ranking Maxim Lifantsev.
A Study of Approaches to Hypertext Categorization
(2002) Yiming Yang, Sean Slattery, Rayid Ghani, Journal of Intelligent Information Systems.
o Hypertext Categorization using Hyperlink Patterns
and Meta Data (2001) Rayid Ghani, Sean Slattery, Yiming Yang.