dissertation proposal - california state polytechnic ...zma/research/dissertation proposa… ·...

Dissertation Proposal

WEB MINING FOR KNOWLEDGE DISCOVERY

Zhongming Ma

Ph.D. Candidate in Information Systems

School of Accounting and Information Systems

David Eccles School of Business

The University of Utah

Co-chairs

Dr. Gautam Pant and Dr. Olivia Sheng

Committee members

Dr. Paul Hu

Dr. Ellen Riloff

Dr. Wei Gao

1 DISSERTATION PROPOSAL

1.1 Knowledge Discovery on the Web

Knowledge discovery from databases (KDD) refers to “the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable patters in data”

[Fayyad et al. 1996]. KDD has achieved a broad range of applications including pattern

recognition and predictive analytics in many different areas, such as engineering, business, and

science. Knowledge discovery has two types of goals, verification and discovery. In general the

former goal refers to verifying a user’s hypothesis and the latter can be further divided into

prediction (i.e., predicting unknown or future values) and description (i.e., presenting identified

results such as patterns in a human-understandable form) [Fayyad et al. 1996].

The Web has become a universal repository with tremendous amount of data that can be

accessed from any where in the world and has experienced continuous growth both in content

and its users. Therefore, the Web presents immense opportunities for discovering knowledge.

However, unlike conventional databases, the data on Web is mostly unstructured. This situation

makes knowledge discovery on Web challenging as compared to KDD on traditional databases.

On the Web, the knowledge discovery process requires considerable effort on identifying,

selecting, and processing data possibly from multiple sources and in different (often free-form

text) formats. Manual analysis that turns such large volumes of Web data into knowledge is

impractical and thus knowledge discovery on the Web becomes an attempt to address the

accentuated problem of data overload. We adapt the KDD process presented in [Fayyad et al.

1996] for Web mining and present the process of Wed mining for knowledge discovery as

follows.

Figure 1. Process of Web mining for knowledge discovery

Web mining is a step in the KDD process and it aims to analyze data and discover

knowledge from the Web. The Web data includes all kinds of Web documents, hyperlinks

among Web pages, and Web usage logs. Depending on the type of Web data being mined, Web

mining can be broadly divided into three categories: Web content mining, Web structure mining,

and Web usage mining [Srivastava et al. 2000].

Web content mining is the process of discovering knowledge from Web page content (i.e.,

often text), and it often uses techniques based on data mining and text mining [Liu 2006].

Important Web content mining problems include data/information extraction [e.g. Hammer et

al. 1997], Web information integration [e.g. Knoblock et al. 1998], online opinion extraction,

Web search [e.g. Brin and Page 1998], processing (e.g., clustering or categorizing) search

results according to page content [e.g. Zamir and Etzioni 1999; Dumais and Chen 2001], etc

[Liu 2006].

Web structure mining tries to discover useful information such as importance of pages from

the structure of hyperlinks on the basis of social network analysis (SNA) techniques and

graph theory. Its research topics cover ranking pages [e.g. Brin and Page 1998; Chakrabarti

el al. 1999], finding Web community [e.g. Gibson et al. 1998], etc.

Web usage mining is the automatic discovery of user access patterns from Web logs [Cooley

et al. 1997]. The identified visit patterns can help in understanding the overall access patterns

and trends for all users [e.g. Zaïane et al. 1998] and allow for Web site design that is

responsive to business goals and customer needs, such as user-level customization [e.g.

Eirinaki and Vazirgiannis 2003].

My dissertation consists of two related topics/parts: personalized search and business

relationship discovery, both of which are in the area of Web mining for knowledge discovery.

The first topic presents and evaluates an automatic personalized search framework that

categorizes search results under user’s interests in order to examine how the proposed

personalized search approach outperforms non-categorized and non-personalized baseline

systems. This research is of Web content mining. The second topic proposes an approach to

identifying an intercompany network using company citations from Web content (more

specifically, online news stories) and discovers business relationships between companies from

the network on the basis of SNA and machine learning techniques. Therefore the second topic

covers both Web content mining and Web structure mining. The main research question we

explore is whether structural attributes derived from the intercompany network, which in turn is

derived from company citations in online news, can identify business relationships. As shown in

Figure 2, at a high level, the first topic connects Web content to people, and the second uses Web

content to discover connections between companies. Thus the two topics are connected through

mining of Web content. However, the two topics generate different types of knowledge –

interest-based personalized search results versus news-driven inter-company relationships – and

hence entail diverse adoptions of Web data, processing, and Web mining. In the next two

sections we briefly introduce the two topics.

Figure 2. Process View of the Two Topics of the Dissertation

1.2 Personalized Search

Most search engines, including the most popular ones such as Google and Yahoo!, ignore

users’ search context, such as users’ interests. As a result, the same query from different users

with different information needs retrieves the same search results displayed in the same way.

Hence, they use a “one size fits all” [Lawrence 2000] approach. We note that currently Google is

attempting to address this problem with some level of voluntary personalization. Personalization

techniques that consider users’ context during search can improve search efficiency [Pitkow et

al. 2002]. We propose and implement an automatic approach to categorizing search results

according to a user’s interests to help users find relevant information and find it quicker. Our

approach is particularly well suited for a workplace scenario where much of the information,

needed by the proposed system, about professional interests and skills of knowledge workers is

available to the employer. Personalizing based on such information within an organization can be

expected to have less privacy concerns as compared to a general purpose search engine gathering

data on user interests. Moreover, unlike other approaches, our approach does not impose any

burden of implicit or explicit feedback from the user.

Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search

We customize the general process of Web mining for KDD in Figure 1 and present the

process of interest-based personalized search for knowledge discovery in Figure 3 where

processes covered by the horizontal double-arrow-lines correspond to equivalent ones in Figure

1. The proposed approach includes a mapping framework that automatically maps user interests

into a group of categories from Open Directory Project (ODP) taxonomy. A text classifier is

built from the content of the mapped ODP categories and later is used at query-time to

categorize search results under user interests. For a workplace scenario where the employees’

professional interests and skills can be automatically extracted from their resume or company’s

database, this approach is fully automatic in that users do not need to provide implicit or explicit

feedbacks during the search. Also the use of ODP is transparent to the users. The lack of explicit

or implicit feedback and the use of ODP taxonomy without a user’s awareness of it differentiates

this work from many others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In

addition, we study three search systems with different interfaces for displaying search results.

The first system (LIST) shows search results in a page-by-page list. The second (CAT)

categorizes and displays results under certain ODP categories. The third (PCAT) is what we

propose, and PCAT categorizes and displays results under user interests. We compare the PCAT

with LIST and PCAT with CAT on the basis of different query lengths and different types of

search tasks.

Contributions of this research are that we present an automatic approach to

personalizing Web searches given a set of user interests. The main findings

include (1) PCAT is better than LIST for one word query and Information Gathering type of

task, and PCAT outperforms CAT for free-form queries and for both Information Gathering and

Finding types of tasks in terms of the time spent on finding relevant results. We conclude that

there is not any system universally better than others – the performance of a system depends on

some parameters such as query length and type of task.

1.3 Business Relationship Discovery

Business news contains rich and current information about companies and the

relationships among them. Reading news is very time consuming and requires a reader to possess

certain skills, the most basic of which is a good understanding of the language in which the news

is written. The huge volume of news stories makes the manual identification of relationships

among a large number of companies nontrivial and unscalable. The previous literature using

news to automatically discover business relationships among companies is sparse. Many

researchers in areas such as organization behavior and sociology employ SNA techniques to

investigate the nature and implications of business relationships on the basis of explicitly

specified company relationships provided by reliable data sources [e.g. Levine 1972; Walker et

al. 1997; Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and

computer science tend to identify links between nodes using implicit signals, such as article

citations, URL links, and email communications, derived from large and noisy data sources.

They study problems such as identifying importance of individual nodes (e.g., Web pages,

journal articles) in a network [e.g. Garfield 1979; Brin and Page 1998; Kleinberg 1999] and

finding communities on the Web [e.g. Kautz et al. 1997; Gibson et al. 1998], instead of

discovering business relationships between companies. We present an approach of automatic

discovery of company relationships from online business news using machine learning and SNA

techniques. Figure 4 illustrates the knowledge discovery process for business relationship

discovery from Web data (i.e., online news).

Figure 4. Knowledge Discovery Process for Business Relationship Discovery

Given that a news story pertaining to a company often cites one or more other companies,

we construct a directed and weighted intercompany network on the basis of the citations from a

large amount of online news by considering company citations as directed links from the focal

companies to the cited companies. Further we identify four types of attributes from the network

structure using SNA techniques. More specifically they are dyadic degree based-, node degree

based-, node centrality based-, and structural equivalence based-attributes. Those attributes differ

in their coverage of the network. With those network attributes, we study two types of company

relationships using machine learning methods. This news-driven, SNA-based business

relationship discovery approach is scalable and language-neutral. Research along this line

consists of two studies that differ in their target business relationships and we describe them as

follows.

The first one concentrates on predicting a company revenue relation (CRR). Given a pair

of companies, CRR refers to the relative size of two companies’ annual revenues. We find that

degree-based and centrality-based attributes derived from network structure can predict CRR

with reasonable precision, recall, and accuracy (all above 70%) for all directly linked company

pairs in the network.

Contributions of this study are that (1) our approach can serve as a data filtering step for

studying the revenue relations among very large number of companies. (2) Since the revenue

information for public companies is available quarterly, our approach can be used as a prediction

tool for revenues. (3) Our approach can be applied to discover the revenue relations for private or

foreign companies as well.

In the second work we study competitor relationship between companies. We discover

the competitor relationship between a pair of connected companies in the intercompany network

on the basis of the four types of attributes. And in particular, we study the classification of

company pairs for imbalanced data set where the number of competitor pairs is much smaller

than that of non-competitor pairs. We use two gold standards: Hoovers.com and

Mergentonline.com that are professional company profile websites and contain manually

identified competitors for each company to evaluate the classification performance of our

approach. Given that neither of the gold standards is complete in the coverage of competitors, we

estimate the coverage of each gold standard. Finally we present metrics to estimate how much

our approach can extend each of the gold standards.

Contributions of this work include that we present an automatically approach to

discovering competitor relationships between companies. Our approach is particularly useful to

serve as an initial data filtering step to identify a group of potential competitors for each of many

companies. We study an imbalanced dataset problem and report the classification performance

for competitor pairs in both the imbalanced dataset and the whole dataset. Most important, we

report the estimated extension of our approach to each of two gold standards.

1.4 Overview of Dissertation

At high level the dissertation is organized as follows. Part I, which consists of chapters 2

to 5, is for the first topic of the dissertation: Interested-based Personalized Search. Part II, which

includes chapters 6 to 9, covers the two related studies in business relationship discovery. More

specifically we highlight each chapter as follows.

Chapter 2 introduces the research on personalized search and reviews related prior work. We

detail our approach of personalized search in Chapter 3. Experiments are covered in Chapter 4

and result analyses and conclusions are discussed in Chapter 5. For the topic of business

relationship discovery, we introduce it and review prior literature in Chapter 6. Chapter 7

describes how to identify attributes from the network structure and explain the data and data

processing procedures. We concentrate predicting CRR in Chapter 8 and discovering competitor

relationships in Chapter 9. Finally we conclude the dissertation in Chapter 10.

1.5 Proposed Plan

The time line of my dissertation is as follows.

Feb. 13, 2007 Proposal defense

Mar. 16, 2007 Sending dissertation draft to committee members and to Thesis Office for

format approval

Mar. 30, 2007 Update on the dissertation draft

Apr. 3 or 10, 2007 Dissertation defense

dissertation proposal - california state polytechnic ...zma/research/dissertation proposa… ·...

Documents

scope of and proposa systematicl for of ths e amphistomida

zhaoyang10@nudt.edu.cn, lingxiao.he@nlpr.ia.ac.cn, zma...

zma flex magazine oct 2008

breakdown of proposa

reg.siit.tu.ac.th file · web viewdissertation title...

zma 11-01 willowbrook road zoning map correction january 17,...

request for proposa l global multi-strategy, multi-asset...

zma 90 caps - gnc.com€¦ · title: zma 90 caps author:...

gpa-zma 871 s main - springville · 2019-10-17 · december...

dissertation writing services | dissertation writing help

on problems of operationalization see bollen (1993...

dissertation handbook - california state university...

zma design specification - winchester · the zma series is...

web mining is the application of data mining techniques to...

reg.siit.tu.ac.threg.siit.tu.ac.th/.../thesistemplate25nov19.docx ·...

digital dissertation overview - dissertation top gun

hazel national landing...hazel national landing project...

professional indemnity insurance design & construct...

implicarea i.n.h.g.a. in activitatile de cooperare ......

aggregate packing characteristics using various aggregate...