data mining for web intelligence presentation by julia erdman
TRANSCRIPT
Data Mining for Web Intelligence
Presentation by
Julia Erdman
Data Mining the Web
Searching, comprehending, and using the semi-structured data on the web poses a significant challenge over data mining in a commercial database system
The data from the web is more sophisticated and dynamic
Data mining helps search engine find high-quality web pages
Why Data Mining?
Challenges of data mining the web Web page complexity far exceeds the complexity
of any traditional text document collection The Web constitutes a highly dynamic information
source The Web serves a broad spectrum of user
communities Only a small portion of the Web’s pages contain
truly relevant or useful information
Why Data Mining?
Approaches to accessing information on the web Keyword-based search or topic-directory
browsing i.e. Google, Yahoo
Querying deep Web sources i.e. Amazon.com, Realtor.com
Random surfing
Design Challenges
Traditional schemes for accessing data on the web are based on text-oriented, keyword-based web pages
The current access schemes must be replaced with more sophisticated schemes in order to exploit the Web completely
Access Limitations
Lack of high-quality keyword-based searches A search can return many answers
i.e. searching popular categories, like sports or politics Overloading keyword semantics can return many
low-quality answers i.e. a search for jaguar could be for an animal, car,
sports team A search can miss many highly related pages that
do not contain the posed keywords
Access Limitations
Lack of effective deep-Web access There are at least 100,000 searchable databases
on the Web with high-quality, well-maintained information, but are not effectively accessible
There is an extremely large collection of autonomous and heterogeneous databases, each supporting specific query interfaces with different schema and query constraints
Access Limitations
Lack of automatically constructed directories A topic or type-oriented Web information directory
creates an organized picture of a web sector Developers must organizes these directories
manually Costly Provides only limited coverage Not easily scalable or adaptable
Access Limitations
Lack of semantics-based query primitives Most keyword-based searches only allow of small
set of search options
Access Limitations
Lack of feedback on human activities Web links may not be updated frequently,
regularly, or at all Changes in access frequency do not
automatically adjust search results
Access Limitations
Lack of multidimensional analysis and data mining support Cannot drill deeply into sites in order to find the
data we are looking for
Mining Web search-engine data
Current keyword-based search engines have several deficiencies A widely covered topic can contain hundreds of
thousands of documents Highly relevant documents may not contain the
keywords used in the search
Analyzing the Web’s link structure
When one web page contains a link to another, this can be considered an endorsement of the linked page
Collected endorsements of the same page from many different web authors leads to an authoritative web page
A hub is a single web page that contains a collection of links to authoritative web pages
Classifying Web documents automatically
Generally, human readers classify Web documents, but an automatic classification is highly desirable
Hyperlinks contain high-quality semantic clues to a page’s topic, which can help achieve accurate classifications However, links to unrelated sites can cloud the
classification i.e. many sites have a link to weather.com, but generally
are not weather sites
Automatic classification can determine what classification a web page belongs to, but not to which classification it does not belong to
Mining Web page semantics structures and page contents
Fully automatic extraction of Web page structures and semantic contents can be difficult due to the limitations on automated natural-languages parsing
Semiautomatic methods can recognize a portion of such structures
Then further analysis can see how the contents fit into these structures
Mining Web page semantics structures and page contents
To identify the structures to extract, either an expert manually specifies the structures, or techniques must be developed to automatically produce the structures
Or developers can use Web page classes for automatic extraction
Semantic page structure and content recognition will provide for more in-depth analysis of Web pages
Mining Web dynamics
Contents, structures, and access patterns change on the Web Storing historical data about Web pages assists in
finding changes in content and links But due to phenomenal breadth of the Web, it is
impossible to store images and updates Mining web logs records can provide quality
results This data needs to be analyzed and transformed into
useful, significant information
Building a multilayered, multidimensional Web
Systematically analyze a set of Web pages Group closely related local Web pages or an
individual page into a cluster, called a semantic page The analysis provides a descriptor for the cluster
Then create a semantics-based, evolving, multidimensional, multilayered Web information directory
Questions? Comments?
Jiawei H. & Chang, K.C.-C. "Data mining for Web intelligence" IEEE Computer, Volume 35, Issue 11, Nov. 2002. pp. 64- 70.