web mining

Web MiningBy:-Mudit Dholakia

Guide:-Dr. Amit Ganatra Sir

What is web mining?

• Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services.• Discovering Knowledge from and about WWW - is one of the basic

abilities of an intelligent agent.

Knowledge

WWW

Web Mining .vs. Data Mining

• Structure (or lack of it)• Textual information and linkage structure

• Scale• Data generated per day is comparable to largest conventional data

warehouses

• Speed• Often need to react to evolving usage patterns in real-time (e.g.,

merchandising)

Web Mining topics

• Web graph analysis• Power Laws and The Long Tail• Structured data extraction• Web advertising • Systems Issues

Size of the Web

• Number of pages• Technically, infinite• Much duplication (30-40%)• Best estimate of “unique” static HTML pages comes from search engine

claims• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion• Google recently announced that their index contains 1 trillion pages

• How to explain the discrepancy?

The web as a graph

• Pages = nodes, hyperlinks = edges• Ignore content• Directed graph

• High linkage• 10-20 links/page on average• Power-law degree distribution

Structure of Web graph

Power-law degree distribution

Measures

• Structure• In-degrees• Out-degrees• Number of pages per site

• Usage patterns• Number of visitors• Popularity e.g., products, movies, music

The Long Tail

Measures

• Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,…

• The web enables near-zero-cost dissemination of information about products• More choice necessitates better filters

• Recommendation engines (e.g., Amazon)• How Into Thin Air made Touching the Void a bestseller

Searching the Web

Content aggregatorsThe Web Content consumers

Two approaches for analyzing data

• Machine Learning approach• Emphasizes sophisticated algorithms e.g., Support Vector Machines• Data sets tend to be small, fit in memory

• Data Mining approach• Emphasizes big data sets (e.g., in the terabytes)• Data cannot even fit on a single disk!• Necessarily leads to simpler algorithms

View of mining system

Mem

Disk

CPU

Mem

Disk

CPU

Mem

Disk

CPU…

Issues

• Web data sets can be very large • Tens to hundreds of terabytes

• Cannot mine on a single server!• Need large farms of servers

• How to organize hardware/software to mine multi-terabyte data sets• Without breaking the bank!

What it should do?

• Finding relevant information • Low precision and unindexed information

• Creating new knowledge out of available information on the web• A data-triggered process

• Personalizing the information• Personal preference in content and presentation of the information

• Learning about the consumers • What does the customer want to do?

Direct vs Indirect web mining

• Web mining techniques can be used to solve the information overload problems:

DirectlyAddress the problem with web mining techniques

E.g. newsgroup agent classifies whether the news as relevantIndirectly

Used as part of a bigger application that addresses problemsE.g. used to create index terms for a web search service

Web Mining Categories

• Web Content MiningDiscovering useful information from web page

contents/data/documents.

• Web Structure MiningDiscovering the model underlying link structures (topology)

on the Web. E.g. discovering authorities and hubs

• Web Usage MiningExtraction of interesting knowledge from logging information

produced by web servers.Usage data from logs, user profiles, user sessions, cookies, user

queries, bookmarks, mouse clicks and scrolls, etc.

Types

• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining

IRSystem

Query

Documentssource

RankedDocuments

Document

DocumentDocument

ClusteringSystem

Similarity measure

Documentssource

DocDo

cDoc

Doc

Doc

DocDoc

Doc

DocDoc

Web Content Data Structure

• Web content consists of several types of data• Text, image, audio, video, hyperlinks.

• Unstructured – free text• Semi-structured – HTML• More structured – Data in the tables or database generated HTML

pagesNote: much of the Web content data is unstructured text data.

Web Content Mining

• Unstructured DocumentsBag of words to represent unstructured documents

Takes single word as feature Ignores the sequence in which words occur

Features could be Boolean

Word either occurs or does not occur in a document Frequency based

Frequency of the word in a documentVariations of the feature selection include

Removing the case, punctuation, infrequent words and stop wordsFeatures can be reduced using different feature selection techniques:

Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.

Web Content Mining

• Semi-Structured DocumentsUses richer representations for features

Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.

Web Content Mining: DB View

• The database techniques on the Web are related to the problems of managing and querying the information on the Web.• DB view tries to infer the structure of a Web site or transform a Web site to

become a database Better information managementBetter querying on the Web

• Can be achieved by:Finding the schema of Web documentsBuilding a Web warehouseBuilding a Web knowledge baseBuilding a virtual database

Web Content Mining: DB View• DB view mainly uses the Object Exchange Model (OEM)

Represents semi-structured data by a labeled graphThe data in the OEM is viewed as a graph, with objects as the vertices

and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex

• Process typically starts with manual selection of Web sites for doing Web content mining• Main application:

• The task of finding frequent substructures in semi-structured data• The task of creating multi-layered database

Taxonomies

• Ranking• Graph Search• Communities• Hyperlink Induced Topic Search• SEO• Hub & Authorities

Web Structure Mining

• Interested in the structure of the hyperlinks within the Web• Inspired by the study of social networks and citation analysis• Can discover specific types of pages(such as hubs, authorities, etc.) based on

the incoming and outgoing links.

• Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site

Web Usage Mining• Tries to predict user behavior from interaction

with the Web• Wide range of data (logs)

Web client data Proxy server data Web server data

• Two common approaches Maps the usage data of Web server into relational tables before

an adapted data mining techniques Uses the log data directly by utilizing special pre-processing

techniques

Web Usage Mining

Pre-Processing Pattern Discovery Pattern Analysis

User sessionFile Rules and Patterns Interesting

Knowledge

XML View

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

...

33

Use of Multi-Layer Meta Web• Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis• Approximate and intelligent query answering• Web high-level query answering (WebSQL, WebML)• Web content and structure mining• Observing the dynamics/evolution of the Web

• Is it realistic to construct such a meta-Web?• Benefits even if it is partially constructed• Benefits may justify the cost of tool development,

standardization and partial restructuring

Web Search Products and ServicesAlta VistaDB2 text extenderExciteFulcrumGlimpse (Academic)Google! Inforseek Internet Inforseek Intranet Inktomi (HotBot) Lycos

PLSSmart (Academic)Oracle text extender Verity Yahoo!

Web Usage Mining

• Typical problems: • Distinguishing among unique users, server sessions,

episodes, etc. in the presence of caching and proxy servers

• Often Usage Mining uses some background or domain knowledge

E.g. site topology, Web content, etc.

Web Usage Mining

• Applications:• Two main categories:

Learning a user profile (personalized)Web users would be interested in techniques that learn their needs and preferences automatically

Learning user navigation patterns (impersonalized)Information providers would be interested in techniques that

improve the effectiveness of their Web site

References

• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt

x

http://www.cs.jyu.fi/ai/vagan/Web_Mining.ppt






http://www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt








http://www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.pptx




Thank You

web mining

Data & Analytics

web documentsservices

issues web data sets

data mining structure

web number of pages

structure of web graph

data mining techniques

web search service

memory data mining approach