dissertation proposal - california state polytechnic ...zma/research/dissertation proposa… ·...
Post on 27-Jun-2020
3 Views
Preview:
TRANSCRIPT
Dissertation Proposal
WEB MINING FOR KNOWLEDGE DISCOVERY
Zhongming Ma
Ph.D. Candidate in Information Systems
School of Accounting and Information Systems
David Eccles School of Business
The University of Utah
Co-chairs
Dr. Gautam Pant and Dr. Olivia Sheng
Committee members
Dr. Paul Hu
Dr. Ellen Riloff
Dr. Wei Gao
1
1 DISSERTATION PROPOSAL
1.1 Knowledge Discovery on the Web
Knowledge discovery from databases (KDD) refers to “the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patters in data”
[Fayyad et al. 1996]. KDD has achieved a broad range of applications including pattern
recognition and predictive analytics in many different areas, such as engineering, business, and
science. Knowledge discovery has two types of goals, verification and discovery. In general the
former goal refers to verifying a user’s hypothesis and the latter can be further divided into
prediction (i.e., predicting unknown or future values) and description (i.e., presenting identified
results such as patterns in a human-understandable form) [Fayyad et al. 1996].
The Web has become a universal repository with tremendous amount of data that can be
accessed from any where in the world and has experienced continuous growth both in content
and its users. Therefore, the Web presents immense opportunities for discovering knowledge.
However, unlike conventional databases, the data on Web is mostly unstructured. This situation
makes knowledge discovery on Web challenging as compared to KDD on traditional databases.
On the Web, the knowledge discovery process requires considerable effort on identifying,
selecting, and processing data possibly from multiple sources and in different (often free-form
text) formats. Manual analysis that turns such large volumes of Web data into knowledge is
impractical and thus knowledge discovery on the Web becomes an attempt to address the
accentuated problem of data overload. We adapt the KDD process presented in [Fayyad et al.
2
1996] for Web mining and present the process of Wed mining for knowledge discovery as
follows.
Figure 1. Process of Web mining for knowledge discovery
Web mining is a step in the KDD process and it aims to analyze data and discover
knowledge from the Web. The Web data includes all kinds of Web documents, hyperlinks
among Web pages, and Web usage logs. Depending on the type of Web data being mined, Web
mining can be broadly divided into three categories: Web content mining, Web structure mining,
and Web usage mining [Srivastava et al. 2000].
Web content mining is the process of discovering knowledge from Web page content (i.e.,
often text), and it often uses techniques based on data mining and text mining [Liu 2006].
Important Web content mining problems include data/information extraction [e.g. Hammer et
al. 1997], Web information integration [e.g. Knoblock et al. 1998], online opinion extraction,
Web search [e.g. Brin and Page 1998], processing (e.g., clustering or categorizing) search
results according to page content [e.g. Zamir and Etzioni 1999; Dumais and Chen 2001], etc
[Liu 2006].
3
Web structure mining tries to discover useful information such as importance of pages from
the structure of hyperlinks on the basis of social network analysis (SNA) techniques and
graph theory. Its research topics cover ranking pages [e.g. Brin and Page 1998; Chakrabarti
el al. 1999], finding Web community [e.g. Gibson et al. 1998], etc.
Web usage mining is the automatic discovery of user access patterns from Web logs [Cooley
et al. 1997]. The identified visit patterns can help in understanding the overall access patterns
and trends for all users [e.g. Zaïane et al. 1998] and allow for Web site design that is
responsive to business goals and customer needs, such as user-level customization [e.g.
Eirinaki and Vazirgiannis 2003].
My dissertation consists of two related topics/parts: personalized search and business
relationship discovery, both of which are in the area of Web mining for knowledge discovery.
The first topic presents and evaluates an automatic personalized search framework that
categorizes search results under user’s interests in order to examine how the proposed
personalized search approach outperforms non-categorized and non-personalized baseline
systems. This research is of Web content mining. The second topic proposes an approach to
identifying an intercompany network using company citations from Web content (more
specifically, online news stories) and discovers business relationships between companies from
the network on the basis of SNA and machine learning techniques. Therefore the second topic
covers both Web content mining and Web structure mining. The main research question we
explore is whether structural attributes derived from the intercompany network, which in turn is
derived from company citations in online news, can identify business relationships. As shown in
Figure 2, at a high level, the first topic connects Web content to people, and the second uses Web
content to discover connections between companies. Thus the two topics are connected through
4
mining of Web content. However, the two topics generate different types of knowledge –
interest-based personalized search results versus news-driven inter-company relationships – and
hence entail diverse adoptions of Web data, processing, and Web mining. In the next two
sections we briefly introduce the two topics.
Figure 2. Process View of the Two Topics of the Dissertation
1.2 Personalized Search
Most search engines, including the most popular ones such as Google and Yahoo!, ignore
users’ search context, such as users’ interests. As a result, the same query from different users
with different information needs retrieves the same search results displayed in the same way.
Hence, they use a “one size fits all” [Lawrence 2000] approach. We note that currently Google is
attempting to address this problem with some level of voluntary personalization. Personalization
techniques that consider users’ context during search can improve search efficiency [Pitkow et
5
al. 2002]. We propose and implement an automatic approach to categorizing search results
according to a user’s interests to help users find relevant information and find it quicker. Our
approach is particularly well suited for a workplace scenario where much of the information,
needed by the proposed system, about professional interests and skills of knowledge workers is
available to the employer. Personalizing based on such information within an organization can be
expected to have less privacy concerns as compared to a general purpose search engine gathering
data on user interests. Moreover, unlike other approaches, our approach does not impose any
burden of implicit or explicit feedback from the user.
Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search
We customize the general process of Web mining for KDD in Figure 1 and present the
process of interest-based personalized search for knowledge discovery in Figure 3 where
processes covered by the horizontal double-arrow-lines correspond to equivalent ones in Figure
1. The proposed approach includes a mapping framework that automatically maps user interests
6
into a group of categories from Open Directory Project (ODP) taxonomy. A text classifier is
built from the content of the mapped ODP categories and later is used at query-time to
categorize search results under user interests. For a workplace scenario where the employees’
professional interests and skills can be automatically extracted from their resume or company’s
database, this approach is fully automatic in that users do not need to provide implicit or explicit
feedbacks during the search. Also the use of ODP is transparent to the users. The lack of explicit
or implicit feedback and the use of ODP taxonomy without a user’s awareness of it differentiates
this work from many others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In
addition, we study three search systems with different interfaces for displaying search results.
The first system (LIST) shows search results in a page-by-page list. The second (CAT)
categorizes and displays results under certain ODP categories. The third (PCAT) is what we
propose, and PCAT categorizes and displays results under user interests. We compare the PCAT
with LIST and PCAT with CAT on the basis of different query lengths and different types of
search tasks.
Contributions of this research are that we present an automatic approach to
personalizing Web searches given a set of user interests. The main findings
include (1) PCAT is better than LIST for one word query and Information Gathering type of
task, and PCAT outperforms CAT for free-form queries and for both Information Gathering and
Finding types of tasks in terms of the time spent on finding relevant results. We conclude that
there is not any system universally better than others – the performance of a system depends on
some parameters such as query length and type of task.
7
1.3 Business Relationship Discovery
Business news contains rich and current information about companies and the
relationships among them. Reading news is very time consuming and requires a reader to possess
certain skills, the most basic of which is a good understanding of the language in which the news
is written. The huge volume of news stories makes the manual identification of relationships
among a large number of companies nontrivial and unscalable. The previous literature using
news to automatically discover business relationships among companies is sparse. Many
researchers in areas such as organization behavior and sociology employ SNA techniques to
investigate the nature and implications of business relationships on the basis of explicitly
specified company relationships provided by reliable data sources [e.g. Levine 1972; Walker et
al. 1997; Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and
computer science tend to identify links between nodes using implicit signals, such as article
citations, URL links, and email communications, derived from large and noisy data sources.
They study problems such as identifying importance of individual nodes (e.g., Web pages,
journal articles) in a network [e.g. Garfield 1979; Brin and Page 1998; Kleinberg 1999] and
finding communities on the Web [e.g. Kautz et al. 1997; Gibson et al. 1998], instead of
discovering business relationships between companies. We present an approach of automatic
discovery of company relationships from online business news using machine learning and SNA
techniques. Figure 4 illustrates the knowledge discovery process for business relationship
discovery from Web data (i.e., online news).
8
Figure 4. Knowledge Discovery Process for Business Relationship Discovery
Given that a news story pertaining to a company often cites one or more other companies,
we construct a directed and weighted intercompany network on the basis of the citations from a
large amount of online news by considering company citations as directed links from the focal
companies to the cited companies. Further we identify four types of attributes from the network
structure using SNA techniques. More specifically they are dyadic degree based-, node degree
based-, node centrality based-, and structural equivalence based-attributes. Those attributes differ
in their coverage of the network. With those network attributes, we study two types of company
relationships using machine learning methods. This news-driven, SNA-based business
relationship discovery approach is scalable and language-neutral. Research along this line
consists of two studies that differ in their target business relationships and we describe them as
follows.
The first one concentrates on predicting a company revenue relation (CRR). Given a pair
of companies, CRR refers to the relative size of two companies’ annual revenues. We find that
degree-based and centrality-based attributes derived from network structure can predict CRR
9
with reasonable precision, recall, and accuracy (all above 70%) for all directly linked company
pairs in the network.
Contributions of this study are that (1) our approach can serve as a data filtering step for
studying the revenue relations among very large number of companies. (2) Since the revenue
information for public companies is available quarterly, our approach can be used as a prediction
tool for revenues. (3) Our approach can be applied to discover the revenue relations for private or
foreign companies as well.
In the second work we study competitor relationship between companies. We discover
the competitor relationship between a pair of connected companies in the intercompany network
on the basis of the four types of attributes. And in particular, we study the classification of
company pairs for imbalanced data set where the number of competitor pairs is much smaller
than that of non-competitor pairs. We use two gold standards: Hoovers.com and
Mergentonline.com that are professional company profile websites and contain manually
identified competitors for each company to evaluate the classification performance of our
approach. Given that neither of the gold standards is complete in the coverage of competitors, we
estimate the coverage of each gold standard. Finally we present metrics to estimate how much
our approach can extend each of the gold standards.
Contributions of this work include that we present an automatically approach to
discovering competitor relationships between companies. Our approach is particularly useful to
serve as an initial data filtering step to identify a group of potential competitors for each of many
companies. We study an imbalanced dataset problem and report the classification performance
for competitor pairs in both the imbalanced dataset and the whole dataset. Most important, we
report the estimated extension of our approach to each of two gold standards.
10
1.4 Overview of Dissertation
At high level the dissertation is organized as follows. Part I, which consists of chapters 2
to 5, is for the first topic of the dissertation: Interested-based Personalized Search. Part II, which
includes chapters 6 to 9, covers the two related studies in business relationship discovery. More
specifically we highlight each chapter as follows.
Chapter 2 introduces the research on personalized search and reviews related prior work. We
detail our approach of personalized search in Chapter 3. Experiments are covered in Chapter 4
and result analyses and conclusions are discussed in Chapter 5. For the topic of business
relationship discovery, we introduce it and review prior literature in Chapter 6. Chapter 7
describes how to identify attributes from the network structure and explain the data and data
processing procedures. We concentrate predicting CRR in Chapter 8 and discovering competitor
relationships in Chapter 9. Finally we conclude the dissertation in Chapter 10.
1.5 Proposed Plan
The time line of my dissertation is as follows.
Feb. 13, 2007 Proposal defense
Mar. 16, 2007 Sending dissertation draft to committee members and to Thesis Office for
format approval
Mar. 30, 2007 Update on the dissertation draft
Apr. 3 or 10, 2007 Dissertation defense
11
top related