web mining research : a survey

38
Web Web Mining Mining Research Research: A A Survey Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Shan Huang, 4/24/2007

Upload: newton

Post on 12-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Web Mining Research : A Survey. Raymond Kosala and Hendrik Blockeel ACM SIGKDD , July 2000 Presented by Shan Huang, 4/24/2007. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion & Exam Questions. Four Problems. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Mining Research : A Survey

WebWeb MiningMining ResearchResearch: AA SurveySurvey

Raymond Kosala and Hendrik BlockeelACM SIGKDD , July 2000

Presented by Shan Huang,4/24/2007

Page 2: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 3: Web Mining Research : A Survey

Four Problems Finding relevant information

Low precision Unindexed information

Creating new knowledge out of available information on the web

Personalizing the information Catering to personal preference in content and

presentation Learning about the consumers

What does the customer want to do? Using web data to effectively market products and/or

services

Page 4: Web Mining Research : A Survey

Other Approaches

Web mining is not the only approach Database approach (DB) Information retrieval (IR) Natural language processing (NLP)

In-depth syntactic and semantic analysis Web document community

Standards, manually appended meta-information, maintained directories, etc

Page 5: Web Mining Research : A Survey

Direct vs Indirect Web Mining

Web mining techniques can be used to solve the information overload problems: Directly

Attack the problem with web mining techniques E.g. newsgroup agent classifies news as relevant

Indirectly Used as part of a bigger application that addresses

problems E.g. used to create index terms for a web search service

Page 6: Web Mining Research : A Survey

The Research

Converging research from: Database, information retrieval, and artificial intelligence (specifically NLP and machine learning)

Paper focuses on research from the machine learning point of view

Page 7: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 8: Web Mining Research : A Survey

Web Mining: Definition

“Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” Can be viewed as four subtasks Not the same as Information Retrieval Not the same as Information Extraction

Page 9: Web Mining Research : A Survey

Web Mining: SubtasksResource finding Retrieving intended documents

Information selection/pre-processing Select and pre-process specific information from selected

documentsGeneralization Discover general patterns within and across web sites

Analysis Validation and/or interpretation of mined patterns

Page 10: Web Mining Research : A Survey

Web Mining: Not IR or IE

Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possibleWeb document classification, which is a Web Mining task, could be part of an IR system (e.g. indexing for a search engine)

Page 11: Web Mining Research : A Survey

Web Mining: Not IR or IE

Information extraction (IE) aims to extract the relevant facts from given documents while IR aims to select the relevant documents IE systems for the general Web are not feasible Most focus on specific Web sites or content

Page 12: Web Mining Research : A Survey

Web Mining and Machine Learning

As a broad subfield of artificial intelligence, machine learning is concerned with the development of algorithms and techniques that allow computers to "learn". Web mining not the same as learning from the Web.Some applications of machine learning on the web are not Web MiningSome methods used for Web Mining besides machine learningHowever, there is a close relationship between web mining and machine learning.

Page 13: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 14: Web Mining Research : A Survey

Web Mining CategoriesWeb Content Mining Discovering useful information from web contents/data/documents. IR view for finding DB view for modeling

Web Structure Mining Discovering the model underlying link structures (topology) on the

Web E.g. discovering authorities and hubs

Web Usage Mining Make sense of data generated by surfers Usage data from logs, user profiles, user sessions, cookies, user

queries, bookmarks, mouse clicks and scrolls, etc.

Page 15: Web Mining Research : A Survey

Web Content Data Structure

Unstructured – free textSemi-structured – HTMLMore structured – Table or Database generated HTML pagesMultimedia data – receive less attention than text or hypertext

Page 16: Web Mining Research : A Survey
Page 17: Web Mining Research : A Survey

Web Mining: The Agent Paradigm

User Interface Agents information retrieval agents, information filtering

agents, & personal assistant agents.

Distributed Agents distributed agents for knowledge discovery or data

mining. Problem solving by a group of agents

Mobile Agents

Page 18: Web Mining Research : A Survey

Web Mining: The Agent Paradigm

Content-based approach The system searches for items that match based

on an analysis of the content using the user preferences.

Collaborative approach The system tries to find users with similar

interests Recommendations given based on what similar

users did

Page 19: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 20: Web Mining Research : A Survey

Web Content Mining: IR View

Unstructured Documents Bag of words, or phrase-based feature

representation Features can be boolean or frequency based Features can be reduced using different feature

selection techniques Word stemming, combining morphological

variations into one feature

Page 21: Web Mining Research : A Survey
Page 22: Web Mining Research : A Survey

Web Content Mining: IR View

Semi-Structured Documents Uses richer representations for features, based on

information from the document structure (typically HTML and hyperlinks)

Uses common data mining methods (whereas unstructured might use more text mining methods)

Page 23: Web Mining Research : A Survey
Page 24: Web Mining Research : A Survey

Web Content Mining: DB ViewTries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web

Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database

Page 25: Web Mining Research : A Survey

Web Content Mining: DB View

Mainly uses the Object Exchange Model (OEM) Represents semi-structured data (some structure,

no rigid schema) by a labeled graph Process typically starts with manual selection of Web sites for content miningMain application: building a structural summary of semi-structured data (schema extraction or discovery)

Page 26: Web Mining Research : A Survey
Page 27: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 28: Web Mining Research : A Survey

Web Structure Mining

Interested in the structure between Web documents (not within a document)Inspired by the study of social networks and citation analysisExample: PageRank – GoogleApplication: Discovering micro-communities in the WebMeasuring the “completeness” of a Web site

Page 29: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 30: Web Mining Research : A Survey

Web Usage MiningTries to predict user behavior from interaction with the WebWide range of data (logs)

Web client data Proxy server data Web server dataTwo common approaches

1. Map usage data into relational tables before using adapted data mining techniques

2. Use log data directly by utilizing special pre-processing techniques

Page 31: Web Mining Research : A Survey

Web Usage Mining

Typical problems: Distinguishing among unique users, server sessions, episodes, etc in the presence of caching and proxy serversOften Usage Mining uses some background or domain knowledge E.g. site topology, Web content, etc

Page 32: Web Mining Research : A Survey

Web Usage Mining

Two main categories:1. Learning a user profile (personalized)

Web users would be interested in techniques that learn their needs and preferences automatically

2. Learning user navigation patterns (impersonalized)

Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

Page 33: Web Mining Research : A Survey

Outline

IntroductionWeb MiningWeb Content MiningWeb Structure MiningWeb Usage MiningConclusion & Exam Questions

Page 34: Web Mining Research : A Survey

Conclusions

Tried to resolve confusion with regards to the term Web Mining Differentiated from IR and IE

Suggest three Web mining categories: Content, Structure, and Usage Mining

Briefly described approaches for the three categoriesExplored connection with agent paradigm

Page 35: Web Mining Research : A Survey

Exam Question #1

Question: Outline the main characteristics of Web information.

Answer: Web information is huge, diverse, and dynamic.

Page 36: Web Mining Research : A Survey

Exam Question #2Question: How data mining techniques can be used in Web information analysis? Give at least two examples. Classification: classification on server logs using decision

tree, Naïve-Bayes classifier to discover the profiles of users belonging to a particular class

Clustering: Clustering can be used to group users exhibiting similar browsing patterns.

Association Analysis: association analysis can be used to relate pages that are most often referenced together in a single server session.

Page 37: Web Mining Research : A Survey

Exam Question #3

Question: What are the three main areas of interest for Web mining?

Answer: (1) Web Content (2) Web Structure

(3) Web Usage

Page 38: Web Mining Research : A Survey

Thank you!