data mining for web intelligence presentation by julia erdman

Data Mining for Web Intelligence

Presentation by

Julia Erdman

Data Mining the Web

Searching, comprehending, and using the semi-structured data on the web poses a significant challenge over data mining in a commercial database system

The data from the web is more sophisticated and dynamic

Data mining helps search engine find high-quality web pages

Why Data Mining?

Challenges of data mining the web Web page complexity far exceeds the complexity

of any traditional text document collection The Web constitutes a highly dynamic information

source The Web serves a broad spectrum of user

communities Only a small portion of the Web’s pages contain

truly relevant or useful information

Why Data Mining?

Approaches to accessing information on the web Keyword-based search or topic-directory

browsing i.e. Google, Yahoo

Querying deep Web sources i.e. Amazon.com, Realtor.com

Random surfing

Design Challenges

Traditional schemes for accessing data on the web are based on text-oriented, keyword-based web pages

The current access schemes must be replaced with more sophisticated schemes in order to exploit the Web completely

Access Limitations

Lack of high-quality keyword-based searches A search can return many answers

i.e. searching popular categories, like sports or politics Overloading keyword semantics can return many

low-quality answers i.e. a search for jaguar could be for an animal, car,

sports team A search can miss many highly related pages that

do not contain the posed keywords

Access Limitations

Lack of effective deep-Web access There are at least 100,000 searchable databases

on the Web with high-quality, well-maintained information, but are not effectively accessible

There is an extremely large collection of autonomous and heterogeneous databases, each supporting specific query interfaces with different schema and query constraints

Access Limitations

Lack of automatically constructed directories A topic or type-oriented Web information directory

creates an organized picture of a web sector Developers must organizes these directories

manually Costly Provides only limited coverage Not easily scalable or adaptable

Access Limitations

Lack of semantics-based query primitives Most keyword-based searches only allow of small

set of search options

Access Limitations

Lack of feedback on human activities Web links may not be updated frequently,

regularly, or at all Changes in access frequency do not

automatically adjust search results

Access Limitations

Lack of multidimensional analysis and data mining support Cannot drill deeply into sites in order to find the

data we are looking for

Mining Web search-engine data

Current keyword-based search engines have several deficiencies A widely covered topic can contain hundreds of

thousands of documents Highly relevant documents may not contain the

keywords used in the search

Analyzing the Web’s link structure

When one web page contains a link to another, this can be considered an endorsement of the linked page

Collected endorsements of the same page from many different web authors leads to an authoritative web page

A hub is a single web page that contains a collection of links to authoritative web pages

Classifying Web documents automatically

Generally, human readers classify Web documents, but an automatic classification is highly desirable

Hyperlinks contain high-quality semantic clues to a page’s topic, which can help achieve accurate classifications However, links to unrelated sites can cloud the

classification i.e. many sites have a link to weather.com, but generally

are not weather sites

Automatic classification can determine what classification a web page belongs to, but not to which classification it does not belong to

Mining Web page semantics structures and page contents

Fully automatic extraction of Web page structures and semantic contents can be difficult due to the limitations on automated natural-languages parsing

Semiautomatic methods can recognize a portion of such structures

Then further analysis can see how the contents fit into these structures

Mining Web page semantics structures and page contents

To identify the structures to extract, either an expert manually specifies the structures, or techniques must be developed to automatically produce the structures

Or developers can use Web page classes for automatic extraction

Semantic page structure and content recognition will provide for more in-depth analysis of Web pages

Mining Web dynamics

Contents, structures, and access patterns change on the Web Storing historical data about Web pages assists in

finding changes in content and links But due to phenomenal breadth of the Web, it is

impossible to store images and updates Mining web logs records can provide quality

results This data needs to be analyzed and transformed into

useful, significant information

Building a multilayered, multidimensional Web

Systematically analyze a set of Web pages Group closely related local Web pages or an

individual page into a cluster, called a semantic page The analysis provides a descriptor for the cluster

Then create a semantics-based, evolving, multidimensional, multilayered Web information directory

Questions? Comments?

Jiawei H. & Chang, K.C.-C. "Data mining for Web intelligence" IEEE Computer, Volume 35, Issue 11, Nov. 2002. pp. 64- 70.

data mining for web intelligence presentation by julia erdman

Documents