web mining is the application of data mining techniques to ...zma/research/dissertation proposal...
TRANSCRIPT
![Page 1: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/1.jpg)
Dissertation Proposal
WEB MINING FOR KNOWLEDGE DISCOVERY
Zhongming Ma
Ph.D. Candidate in Information Systems
School of Accounting and Information Systems
David Eccles School of Business
The University of Utah
Co-chairs
Dr. Gautam Pant and Dr. Olivia Sheng
Committee members
Dr. Paul Hu
Dr. Ellen Riloff
Dr. Wei Gao
1
![Page 2: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/2.jpg)
TABLE OF CONTENTS
ABSTRACT
1. DISSERTATION PROPOSAL1.1 Knowledge Discovery on the Web1.2 Personalized Search1.3 Business Relationship Discovery1.4 Overview of Dissertation1.5 Proposed Plan
DRAFT DISSERTATION PART I PERSONALIZED SEARCH
2 INTRODUCTION AND LITERATURE REVIEW2.1 Introduction2.2 Literature Review
2.2.1 Query Expansion2.2.2 Result Processing2.2.3 Representing Context Using Taxonomy2.2.4 Taxonomy of Web Activities2.2.5 Text Categorization
3 OUR APPROACH3.1 Step 1: Obtaining an Interest Profile3.2 Step 2: Generating Category Profiles3.3 Step 3: Mapping Interests to ODP Categories
3.3.1 Mapping Method 1: Simple Term Match3.3.2 Mapping Method 2: Most Similar Category Profile3.3.3 Mapping Method 3: Most Similar Category Profile while Augmenting Interest with Potentially Related Nouns3.3.4 Mapping Method 4: Most Similar Category Profile while Augmenting Interest with Potentially Related Noun Phrases
3.4 Step 4: Resolving Mapped Categories3.5 Step 5: categorizing Search Results3.6 Implementation
4 EXPERIMENTS4.1 Studied Domains and Domain Experts4.2 Professional Interests, Search Tasks, and Query Length
4.2.1 Professional Interests (Interest Profiles)4.2.2 Search Tasks4.2.3 Query Length
4.3 Subjects4.4 Experiment Process
2
![Page 3: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/3.jpg)
5 EVALUATIONS AND DISCUSSIONS5.1 Comparing Mean Log Search Time by Query Length5.2 Comparing Mean Log Search Time for Information Gathering Tasks5.3 Comparing Mean Log Search Time for Site Finding Tasks5.4 Comparing Mean Log Search Time for Finding Tasks5.5 Questionnaire and Hypotheses
5.5.1 Questionnaire5.5.2 Hypotheses
5.6 Hypothesis Test Based on Questionnaire5.7 Comparing Indices of Relevant Results5.8 Discussions
DRAFT DISSERTATION PART II BUSINESS RELATIONSHIP DISCOVERY
6 INTRODUCTION AND LITERATURE REVIEW6.1 Introduction6.2 Literature Review
7 NETWORK-BASED ATTRIBUTES AND DATA7.1 Notation in Directed Graphs7.2 Notation in Directed, Weighted Graphs
7.2.1 Dyadic and Node Degree-based Attributes7.2.2 Centrality-based Attributes7.2.3 Structural Equivalence (SE) based Attributes
7.4 Raw Data7.5 Preliminary Data Processing7.6 Node and Link Identification7.7 Attribute Distributions
7.7.1 Node Indegree Distribution7.7.2 Link Weight Distribution7.7.3 Revenue Distribution7.7.4 Revenue Node Weighted Indegree Distribution
8 PREDICTING COMPANY REVENUE RELATIONS (CRR)8.1 Measurements of CRR8.2 Research Questions8.3 Research Methods
8.3.1 Classification Methods8.3.2 Discriminant Analysis with Logistic Regression
8.4 Results and analyses8.4.1 Positive CRR and Top Links by DWND8.4.2 Positive CRR by DWND8.4.3 Predicting CRR8.4.4 Predicting top-N Companies by Revenue8.4.5 Discriminant Variate
8.5 Discussions
3
![Page 4: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/4.jpg)
9 DISCOVERING COMPETITOR RELATIONSHIPS9.1 Approach Outline and Research Questions9.2 Datasets
9.2.1 Dataset I9.2.2 Datasets II and III
9.3 Examining the Competitor Coverage & Competitor Density of the Intercompany Network
9.3.1 Examining the Competitor Coverage9.3.2 Examining the Competitor Density
9.4 Competitor Discovery9.4.1 Evaluation Metrics9.4.2 Classification Methods for Dataset I9.4.3 Classification Methods for Dataset II9.4.4 Classification Performance for Dataset I9.4.5 Classification Performance for Dataset II9.4.6 Estimated Overall Classification Performance for Dataset III
9.5 Competitor Extension9.5.1 Estimating the Coverage of a Gold Standard9.5.2 Estimating the Extension of Our Approach to a Gold Standard
9.6 Explorations on Competitors vs. Non-competitor pairs9.6.1 Comparing SE Similarities between Competitors and Non-Competitor Pairs9.6.2 Comparing Annual Revenues between Competitors and Non-Competitor Pairs
9.7 Discussions
10 CONCLUSIONS
REFERENCES
4
![Page 5: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/5.jpg)
1 DISSERTATION PROPOSAL
1.1 Knowledge Discovery on the Web
Knowledge discovery from databases (KDD) refers to “the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patters in data”
[Fayyad et al. 1996]. KDD has achieved a broad range of applications including pattern
recognition and predictive analytics in many different areas, such as engineering,
business, and science. Knowledge discovery has two types of goals, verification and
discovery. In general the former goal refers to verifying a user’s hypothesis and the latter
can be further divided into prediction (i.e., predicting unknown or future values) and
description (i.e., presenting identified results such as patterns in a human-understandable
form) [Fayyad et al. 1996].
The Web has become a universal repository with tremendous amount of data that
can be accessed from any where in the world and has experienced continuous growth
both in content and its users. Therefore, the Web presents immense opportunities for
discovering knowledge. However, unlike conventional databases, the data on Web is
mostly unstructured. This situation makes knowledge discovery on Web challenging as
compared to KDD on traditional databases. On the Web, the knowledge discovery
process requires considerable effort on identifying, selecting, and processing data
possibly from multiple sources and in different (often free-form text) formats. Manual
analysis that turns such large volumes of Web data into knowledge is impractical and
thus knowledge discovery on the Web becomes an attempt to address the accentuated
5
![Page 6: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/6.jpg)
problem of data overload. We adapt the KDD process presented in [Fayyad et al. 1996]
for Web mining and present the process of Wed mining for knowledge discovery as
follows.
Figure 1. Process of Web mining for knowledge discovery
Web mining is a step in the KDD process and it aims to analyze data and discover
knowledge from the Web. The Web data includes all kinds of Web documents,
hyperlinks among Web pages, and Web usage logs. Depending on the type of Web data
being mined, Web mining can be broadly divided into three categories: Web content
mining, Web structure mining, and Web usage mining [Srivastava et al. 2000].
Web content mining is the process of discovering knowledge from Web page content
(i.e., often text), and it often uses techniques based on data mining and text mining
[Liu 2006]. Important Web content mining problems include data/information
extraction [e.g. Hammer et al. 1997], Web information integration [e.g. Knoblock et
al. 1998], online opinion extraction, Web search [e.g. Brin and Page 1998],
processing (e.g., clustering or categorizing) search results according to page content
[e.g. Zamir and Etzioni 1999; Dumais and Chen 2001], etc [Liu 2006].
6
![Page 7: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/7.jpg)
Web structure mining tries to discover useful information such as importance of
pages from the structure of hyperlinks on the basis of social network analysis (SNA)
techniques and graph theory. Its research topics cover ranking pages [e.g. Brin and
Page 1998; Chakrabarti el al. 1999], finding Web community [e.g. Gibson et al.
1998], etc.
Web usage mining is the automatic discovery of user access patterns from Web logs
[Cooley et al. 1997]. The identified visit patterns can help in understanding the
overall access patterns and trends for all users [e.g. Zaïane et al. 1998] and allow for
Web site design that is responsive to business goals and customer needs, such as user-
level customization [e.g. Eirinaki and Vazirgiannis 2003].
My dissertation consists of two related topics/parts: personalized search and
business relationship discovery, both of which are in the area of Web mining for
knowledge discovery. The first topic presents and evaluates an automatic personalized
search framework that categorizes search results under user’s interests in order to
examine how the proposed personalized search approach outperforms non-categorized
and non-personalized baseline systems. This research is of Web content mining. The
second topic proposes an approach to identifying an intercompany network using
company citations from Web content (more specifically, online news stories) and
discovers business relationships between companies from the network on the basis of
SNA and machine learning techniques. Therefore the second topic covers both Web
content mining and Web structure mining. The main research question we explore is
whether structural attributes derived from the intercompany network, which in turn is
derived from company citations in online news, can identify business relationships. As
7
![Page 8: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/8.jpg)
shown in Figure 2, at a high level, the first topic connects Web content to people, and the
second uses Web content to discover connections between companies. Thus the two
topics are connected through mining of Web content. However, the two topics generate
different types of knowledge – interest-based personalized search results versus news-
driven inter-company relationships – and hence entail diverse adoptions of Web data,
processing, and Web mining. In the next two sections we briefly introduce the two topics.
Figure 2. Process View of the Two Topics of the Dissertation
1.2 Personalized Search
Most search engines, including the most popular ones such as Google and
Yahoo!, ignore users’ search context, such as users’ interests. As a result, the same query
from different users with different information needs retrieves the same search results
displayed in the same way. Hence, they use a “one size fits all” [Lawrence 2000]
8
![Page 9: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/9.jpg)
approach. We note that currently Google is attempting to address this problem with some
level of voluntary personalization. Personalization techniques that consider users’ context
during search can improve search efficiency [Pitkow et al. 2002]. We propose and
implement an automatic approach to categorizing search results according to a user’s
interests to help users find relevant information and find it quicker. Our approach is
particularly well suited for a workplace scenario where much of the information, needed
by the proposed system, about professional interests and skills of knowledge workers is
available to the employer. Personalizing based on such information within an
organization can be expected to have less privacy concerns as compared to a general
purpose search engine gathering data on user interests. Moreover, unlike other
approaches, our approach does not impose any burden of implicit or explicit feedback
from the user.
Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search
9
![Page 10: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/10.jpg)
We customize the general process of Web mining for KDD in Figure 1 and present the
process of interest-based personalized search for knowledge discovery in Figure 3 where
processes covered by the horizontal double-arrow-lines correspond to equivalent ones in
Figure 1. The proposed approach includes a mapping framework that automatically maps
user interests into a group of categories from Open Directory Project (ODP) taxonomy.
A text classifier is built from the content of the mapped ODP categories and later is used
at query-time to categorize search results under user interests. For a workplace scenario
where the employees’ professional interests and skills can be automatically extracted
from their resume or company’s database, this approach is fully automatic in that users
do not need to provide implicit or explicit feedbacks during the search. Also the use of
ODP is transparent to the users. The lack of explicit or implicit feedback and the use of
ODP taxonomy without a user’s awareness of it differentiates this work from many
others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In addition, we
study three search systems with different interfaces for displaying search results. The first
system (LIST) shows search results in a page-by-page list. The second (CAT) categorizes
and displays results under certain ODP categories. The third (PCAT) is what we propose,
and PCAT categorizes and displays results under user interests. We compare the PCAT
with LIST and PCAT with CAT on the basis of different query lengths and different
types of search tasks.
Contributions of this research are that we present an automatic approach to
personalizing Web searches given a set of user interests. The main
findings include (1) PCAT is better than LIST for one word query and Information
Gathering type of task, and PCAT outperforms CAT for free-form queries and for both
10
![Page 11: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/11.jpg)
Information Gathering and Finding types of tasks in terms of the time spent on finding
relevant results. We conclude that there is not any system universally better than others –
the performance of a system depends on some parameters such as query length and type
of task.
1.3 Business Relationship Discovery
Business news contains rich and current information about companies and the
relationships among them. Reading news is very time consuming and requires a reader to
possess certain skills, the most basic of which is a good understanding of the language in
which the news is written. The huge volume of news stories makes the manual
identification of relationships among a large number of companies nontrivial and
unscalable. The previous literature using news to automatically discover business
relationships among companies is sparse. Many researchers in areas such as organization
behavior and sociology employ SNA techniques to investigate the nature and
implications of business relationships on the basis of explicitly specified company
relationships provided by reliable data sources [e.g. Levine 1972; Walker et al. 1997;
Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and
computer science tend to identify links between nodes using implicit signals, such as
article citations, URL links, and email communications, derived from large and noisy
data sources. They study problems such as identifying importance of individual nodes
(e.g., Web pages, journal articles) in a network [e.g. Garfield 1979; Brin and Page 1998;
Kleinberg 1999] and finding communities on the Web [e.g. Kautz et al. 1997; Gibson et
11
![Page 12: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/12.jpg)
al. 1998], instead of discovering business relationships between companies. We present
an approach of automatic discovery of company relationships from online business news
using machine learning and SNA techniques. Figure 4 illustrates the knowledge
discovery process for business relationship discovery from Web data (i.e., online news).
Figure 4. Knowledge Discovery Process for Business Relationship Discovery
Given that a news story pertaining to a company often cites one or more other
companies, we construct a directed and weighted intercompany network on the basis of
the citations from a large amount of online news by considering company citations as
directed links from the focal companies to the cited companies. Further we identify four
types of attributes from the network structure using SNA techniques. More specifically
they are dyadic degree based-, node degree based-, node centrality based-, and structural
equivalence based-attributes. Those attributes differ in their coverage of the network.
With those network attributes, we study two types of company relationships using
machine learning methods. This news-driven, SNA-based business relationship discovery
12
![Page 13: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/13.jpg)
approach is scalable and language-neutral. Research along this line consists of two
studies that differ in their target business relationships and we describe them as follows.
The first one concentrates on predicting a company revenue relation (CRR). Given a pair
of companies, CRR refers to the relative size of two companies’ annual revenues. We
find that degree-based and centrality-based attributes derived from network structure can
predict CRR with reasonable precision, recall, and accuracy (all above 70%) for all
directly linked company pairs in the network.
Contributions of this study are that (1) our approach can serve as a data filtering
step for studying the revenue relations among very large number of companies. (2) Since
the revenue information for public companies is available quarterly, our approach can be
used as a prediction tool for revenues. (3) Our approach can be applied to discover the
revenue relations for private or foreign companies as well.
In the second work we study competitor relationship between companies. We
discover the competitor relationship between a pair of connected companies in the
intercompany network on the basis of the four types of attributes. And in particular, we
study the classification of company pairs for imbalanced data set where the number of
competitor pairs is much smaller than that of non-competitor pairs. We use two gold
standards: Hoovers.com and Mergentonline.com that are professional company profile
websites and contain manually identified competitors for each company to evaluate the
classification performance of our approach. Given that neither of the gold standards is
complete in the coverage of competitors, we estimate the coverage of each gold standard.
Finally we present metrics to estimate how much our approach can extend each of the
gold standards.
13
![Page 14: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/14.jpg)
Contributions of this work include that we present an automatically approach to
discovering competitor relationships between companies. Our approach is particularly
useful to serve as an initial data filtering step to identify a group of potential competitors
for each of many companies. We study an imbalanced dataset problem and report the
classification performance for competitor pairs in both the imbalanced dataset and the
whole dataset. Most important, we report the estimated extension of our approach to each
of two gold standards.
1.4 Overview of Dissertation
At high level the dissertation is organized as follows. Part I, which consists of
chapters 2 to 5, is for the first topic of the dissertation: Interested-based Personalized
Search. Part II, which includes chapters 6 to 9, covers the two related studies in business
relationship discovery. More specifically we highlight each chapter as follows.
Chapter 2 introduces the research on personalized search and reviews related prior
work. We detail our approach of personalized search in Chapter 3. Experiments are
covered in Chapter 4 and result analyses and conclusions are discussed in Chapter 5. For
the topic of business relationship discovery, we introduce it and review prior literature in
Chapter 6. Chapter 7 describes how to identify attributes from the network structure and
explain the data and data processing procedures. We concentrate predicting CRR in
Chapter 8 and discovering competitor relationships in Chapter 9. Finally we conclude the
dissertation in Chapter 10.
14
![Page 15: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/15.jpg)
1.5 Proposed Plan
The time line of my dissertation is as follows.
Feb. 13, 2007 Proposal defense
Mar. 16, 2007 Sending dissertation draft to committee members and to Thesis
Office for format approval
Mar. 30, 2007 Update on the dissertation draft
Apr. 3 or 10, 2007 Dissertation defense
15
![Page 16: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/16.jpg)
DRAFT DISSERTATION PART I INTEREST-BASED PERSONALIZED SEARCH
2 INTRODUCTION AND LITERATURE REVIEW
2.1 Introduction
The Web provides an extremely large and dynamic source of information, and the
continuous creation and updating of Web pages magnifies information overload on the
Web. Both casual and non-casual users (such as knowledge workers) often use search
engines to find a “needle” in this constantly growing “haystack.” Sellen et al. [2002],
who define a knowledge worker as someone “whose paid work involves significant time
spent in gathering, finding, analyzing, creating, producing or archiving information,”
report that 59% of the tasks performed on the Web by a sample of knowledge workers
fall into the categories of Information Gathering and Finding, which require an active use
of Web search engines.
Most existing Web search engines return a list of search results based on a user’s
query but ignore the user’s specific interests and/or search context. Therefore, the
identical query from different users or in different contexts will generate the same set of
results displayed in the same way for all users, a so called “one size fits all” [Lawrence
2000] approach. Furthermore, the number of search results returned by a search engine is
often so large that the results must be partitioned into multiple result pages. In addition,
16
![Page 17: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/17.jpg)
individual differences in information needs, polysemy (multiple meanings of the same
word), and synonymy (multiple words with same meaning) pose problems [Deerwester et
al. 1990], in that a user may have to go through many irrelevant results or try several
queries before finding the desired information. Problems encountered in searching are
exaggerated further when the search engine users employ short queries [Jansen et al.
1998]. However, personalization techniques that put a search in the context of the user’s
interests may alleviate some of these issues.
In this study, which focuses on knowledge workers’ search for information online
in a workplace setting, we assume that some information about the knowledge workers,
such as their professional interests and skills, is known to the employing organization and
can be extracted automatically with an information extraction (IE) tool or database
queries. The organization then can use such information as an input to a system based on
our proposed approach and provide knowledge workers a personalized search tool that
will reduce their search time and boost their productivity.
For a given query, a personalized search can provides different results for
different users or organize the same results differently for each user. It can be
implemented on either the server side (search engine) or the client side (organization’s
intranet or user’s computer). Personalized search implemented on the server side is
computationally expensive when millions of users are using the search engine and also
raises privacy concerns when information about users is stored on the server. A
personalized search on the client side can be achieved by query expansion and/or result
processing [Pitkow et al. 2002]. By adding extra query terms associated with user
interests or search context, the query expansion approach can retrieve different sets of
17
![Page 18: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/18.jpg)
results. The result processing includes result filtering, such as removal of some results,
and reorganizing, such as re-ranking, clustering, and categorizing the results.
Our proposed approach is a form of client-side personalization based on an
interest-to-taxonomy mapping framework and result categorization. It piggybacks on a
standard search engine such as Google1 and categorizes and displays search results on the
basis of known user interests. As a novel feature of our approach, the mapping
framework automatically maps the known user interests onto a set of categories in a Web
directory, such as the Open Directory Project2 (ODP) or Yahoo!3 directory. An advantage
of this mapping framework is that after user interests have been mapped onto the
categories, a large amount of manually edited data under these categories is freely
available to be used to build text classifiers that correspond to these user interests. The
text classifiers then can categorize search results according to the user’s various interests
at query time. The same text classifiers may be used to categorize e-mails and other
digital documents, which suggests our approach may be extended to a broader domain of
content management.
The main research questions that we explore are as follows: (1) What is an
appropriate framework for mapping a user’s professional interests and skills onto a group
of concepts in an taxonomy such as a Web directory? (2) How does a personalized
categorization system (PCAT) based on our proposed approach perform differently from
a list interface system (LIST), similar to a conventional search engine? (3) How does
PCAT perform differently from a non-personalized categorization system (CAT) that
categorizes results without any personalization? The third question attempts to separate
1 http://www.google.com.2 http://www.dmoz.com.3 http://www.yahoo.com.
18
![Page 19: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/19.jpg)
the effect of categorization from the effect of personalization in the proposed system. We
explore the second and third questions along two dimensions: type of task and query
length.
Figure 5 illustrates the input and output of these three systems. LIST requires two
inputs: a search query and a search engine, and its output, similar to what a conventional
search engine adopts, is a page-by-page list of search results. Using a large taxonomy
(ODP Web directory), CAT classifies search results and displays them under some
taxonomy categories; in other words, it uses the ODP taxonomy as an additional input.
Finally, PCAT adds another input: a set of user interests. The mapping framework in
PCAT automatically identifies a group of categories from the ODP taxonomy as relevant
to the user’s interests. Using data from these relevant categories, the system generates
text classifiers to categorize search results under the user’s various interests at query time.
19
![Page 20: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/20.jpg)
Figure 5. Input and output of the three systems
We compare PCAT with LIST and with CAT in two sets of controlled
experiments. Compared with LIST, PCAT works better for searches with short queries
and for Information Gathering tasks. In addition, PCAT outperforms CAT for both
Information Gathering and Finding tasks and for searches with free-form queries.
Subjects indicate that PCAT enable them to identify relevant results and complete given
tasks more quickly and easily than does LIST or CAT.
2.2 Related Literature
This section reviews prior studies pertaining to personalized search. We also
consider several studies using the ODP taxonomy to represent a search context, review
studies on the taxonomy of Web activities, and end by briefly discussing text
categorization.
According to Lawrence [2000], next generation search engines will increasingly
use context information. Pitkow et al. [2002] also suggest that a contextual computing
approach that enhances user interactions through a greater understanding of the user, the
context, and the applications may prove a breakthrough in personalized search efficiency.
They further identify two primary ways to personalize search: query expansion and result
processing [Pitkow et al. 2002] which can complement each other.
20
![Page 21: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/21.jpg)
2.2.1 Query Expansion
We use an approach similar to query expansion for finding terms related with user
interests in our interest mapping framework. Query expansion refers to the process of
augmenting a query from a user with other words or phrases to improve search
effectiveness. It originally was applied in information retrieval (IR) to solve the problem
of word mismatch that arises when search engine users employ different terms than those
used by content authors to describe the same concept [Xu and Croft 1996]. Because the
word mismatch problem can be reduced through the use of longer queries, query
expansion may offer a solution [Xu and Croft 1996].
In line with query expansion, current literature provides various definitions of
context. In the Inquirus 2 project [Glover et al. 1999], a user manually chooses a context
in the form of a category, such as “research papers” or “organizational homepages,”
before starting a search. Y!Q4, a large-scale contextual search system, allows a user to
choose a context in the form of a few words or a whole article through three methods: a
novel information widget executed in the user’s Web browser, Yahoo! Toolbar5 or
Yahoo! Messenger6 [Kraft et al. 2005]. In the Watson project, Budzik and Hammond
[2000] derive context information from the whole document a user views. Instead of
using a whole document, Finkelstein et al. [2002] limit the context to the text surrounding
a user-marked query term(s) in the document. That text is part of the whole document, so
their query expansion is based on a local context analysis approach [Xu and Croft 1996].
4 http://yq.search.yahoo.com.5 http://toolbar.yahoo.com.6 http://beta.messenger.yahoo.com.
21
![Page 22: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/22.jpg)
Leroy et al. [2003] define context as the combination of titles and descriptions of clicked
search results after an initial query. In all these studies, queries get expanded on the basis
of the context information, and results are generated according to the expanded queries.
2.2.2 Result Processing
Relatively fewer studies deal with result processing, which includes result
filtering and reorganizing. Domain filtering eliminates documents irrelevant to given
domains from the search results [Oyama et al. 2004]. For example, Ahoy!, a homepage
finder system, uses domain-specific filtering to eliminate most results returned by one or
more search engines but retain a few pages that are likely to be personal homepages
[Shakes et al. 1997]. Tan and Teo [1998] propose a system that filters out news items that
may not be of interest to a given user according to that user’s explicit (e.g., satisfaction
ratings) and implicit (e.g., viewing order, duration) feedback to create personalized news.
Another approach to result processing is to reorganize, which involves re-ranking,
clustering, and categorizing search results. For example, Teevan et al. [2005] construct a
user profile (context) over time with rich resources including issued queries, visited Web
pages, composed or read documents and e-mails. When the user sends a query, the
system re-ranks the search results on the basis of the learned profile. Shen et al. [2005a]
use previous queries and summaries of clicked results in the current session to re-rank
results for a given query. Similarly, UCAIR [Shen et al. 2005b], a client-side
personalized search agent, employs both query expansion on the basis of the immediately
preceding query and result re-ranking on the basis of summaries of viewed results. Other
22
![Page 23: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/23.jpg)
works also consider re-ranking according to a user profile [Gauch et al. 2003; Sugiyama
et al. 2004; Speretta and Gauch 2005; Chirita et al. 2005; Kraft et al. 2005]. Gauch et al.
[2003] and Sugiyama et al. [2004] learn a user’s profile from his or her browsing history,
whereas Speretta and Gauch [2005] build the profile on the basis of search history, and
Chirita et al. [2005] require the user to specify the profile entries manually.
Scatter/Gather [Cutting et al. 1992] is one of the first systems to present
documents in clusters. Another system, Grouper [Zamir and Etzioni 1999], uses snippets
of search engine results to cluster the results. Tan [2002] presents a user-configurable
clustering approach that clusters search results using titles and snippets of search results
and the user can manually modify these clusters.
Finally, in comparing seven interfaces that display search results, Dumais and
Chen [2001] report that all interfaces that group results into categories are more effective
than conventional interfaces that display results as a list. They also conclude that the best
performance occurs when both category names and individual page titles and summaries
are presented. We closely follow these recommendations for the two categorization
systems we study (PCAT and CAT). In recent work, Käki [2005] also finds that result
categorization is helpful when the search engine fails to provide relevant results at the top
of the list.
2.2.3 Representing Context Using Taxonomy
In our approach we map user interests to categories in the ODP taxonomy. Figure
6 shows a portion of the ODP taxonomy, in which Computers is a depth-one category and
23
![Page 24: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/24.jpg)
C++ and Java are categories at depth four. We refer to Computers/Programming/
Languages as the parent category of category C++ or Java. Hence various concepts
(categories) are related through a hierarchy in the taxonomy. Currently, the ODP is a
manually edited directory of 4.6 million URLs that have been categorized into 787,774
categories by 68,983 human editors. The ODP taxonomy has been applied to
personalization of Web search in some prior studies [Pitkow et al. 2002, Gauch et al.
2003, Liu et al. 2004 and Chirita et al. 2005].
Figure 6. ODP taxonomy
For example, the Outride personalized search system (acquired by Google)
performs both query modification and result processing. It builds a user profile (context)
on the basis of a set of personal favorite links, the user’s last 1000 unique clicks, and the
ODP taxonomy, then modifies queries according to that profile. It also re-ranks search
results on the basis of usage and the user profile. The main focus of the Outride system is
24
![Page 25: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/25.jpg)
capturing a user’s profile through his or her search and browsing behaviors [Pitkow et al.
2002]. The OBIWAN system [Gauch et al. 2003] automatically learns a user’s interest
profile from his or her browsing history and represents those interests with concepts in
Magellan taxonomy. It maps each visited Web page into five taxonomy concepts with the
highest similarities; thus, the user profile consists of accumulated categories generated
over a collection of visited pages. Liu et al. [2004] also build a user profile that consists
of previous search query terms and five words that surround each query term in each
Web page clicked after the query is issued. The user profile then is used to map the user’s
search query onto three depth-two ODP categories. In contrast, Chirita et al. [2005] use a
system in which a user manually selects ODP categories as entries in his or her profile.
When re-ranking search results, they measure the similarity between a search result and
the user profile using the node distance in an taxonomy concept tree, which means the
search result must associate with an ODP category. A difficulty in their study is that
many parameters’ values have been set without explanations. The current Google
personalized search7 also explicitly asks users to specify their interests through the
Google directory.
Similar to Gauch et al. [2003], we represent user interests with taxonomy
concepts, but we do not need to collect browsing history. Unlike Liu et al. [2004], we do
not need to gather previous search history, such as search queries and clicked pages, or
know the ODP categories corresponding to the clicked pages. Whereas Gauch et al
[2003] map a visited page onto five ODP categories and Liu et al. [2004] map a search
query onto three categories, we automatically map a user interest onto an ODP category.
A difference between [Chirita et al. 2005] and our approach is that when mapping a
7 http://labs.google.com/personalized.
25
![Page 26: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/26.jpg)
user’s interest onto an taxonomy concept, we employ text, i.e. page titles and summaries
associated with the concept in taxonomy, while they use the taxonomy category title and
its position in the concept tree when computing the tree-node distance. Also, in contrast
to UCAIR [Shen et al. 2005b] that uses contextual information in the current session
(short-term context) to personalize search, our approach personalizes search according to
user’s long-term interests which may be extracted from his or her resume.
Haveliwala [2002] and Jeh and Widom [2003] extend the PageRank algorithm
[Brin and Page 1998] to generate personalized ranks. Using 16 depth-one categories in
ODP, Haveliwala [2002] computes a set of topic-sensitive PageRank scores. The original
PageRank is a global measure of the query- or topic-insensitive popularity of Web pages,
measured solely by a linkage graph derived from a large part of the Web. Haveliwala’s
experiments indicate that, compared with the original PageRank, a topic-sensitive
PageRank achieves greater precision in top-ten search results. Topic-sensitive PageRank
also can be used for personalization after a user’s interests have been mapped onto
appropriate depth-one categories of the ODP, which can be achieved through our
proposed mapping framework. Jeh and Widom [2003] present a scalable personalized
PageRank method, in which they identify a linear relationship between basis vectors and
the corresponding personalized PageRank vectors. At query time, their method constructs
an approximation to the personalized PageRank vector from the pre-computed basis
vectors.
2.2.4 Taxonomy of Web Activities
26
![Page 27: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/27.jpg)
We study the performance of the three systems (described in Section 1) by
considering different types of Web activities. Sellen et al. [2002] categorize Web
activities into six categories: Finding (locate something specific), Information Gathering
(answer a set of questions; less specific than Finding), Browsing (visit sites without
explicit goals), Transacting (execute a transaction), Communicating (participate in chat
rooms or discussion groups), and Housekeeping (check the accuracy and functionality of
Web resources). As Craswell et al. [2001] define a Site Finding task specifically as "one
where the user wants to find a particular site, and their query names the site," we consider
it a type of Finding task. It should be noted that some Web activities, especially
Information Gathering, can involve several searches. On the basis of the intent behind
Web queries, Broder [2002] classifies Web searches into three classes: Navigational
(reach a particular site), Informational (acquire information from one or more Web
pages), and Transactional (perform some Web-mediated activities). As the taxonomy of
search activities suggested by Sellen et al. [2002] is broader than that by Broder [2002],
in this article we choose to study the two major types of activities studied in [Sellen et al.
2002].
2.2.5 Text Categorization
In our study, CAT and PCAT systems employ text classifiers to categorize search
results. Text categorization (TC) is a supervised learning task that classifies new
documents into a set of predefined categories [Yang and Liu 1999]. As a joint discipline
of machine learning and IR, TC has been studied extensively, and many different
27
![Page 28: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/28.jpg)
classification algorithms (classifiers) have been introduced and tested, including the
Rocchio method, naïve Bayes, decision tree, neural networks, and support vector
machines [Sebastiani 2002]. A standard information retrieval metric, cosine similarity
[Salton and McGill 1986], computes the cosine angle between vector representations of
two text fragments or documents. In TC, a document can be assigned to the category with
the highest similarity score. Due to its simplicity and effectiveness, cosine similarity has
been used by many studies for TC [e.g. Yang and Liu 1999; Sugiyama et al. 2004; Liu et
al. 2004].
In summary, to generate user profiles for personalized search, previous studies
have asked users for explicit feedback, such as ratings and preferences, or collected
implicit feedback, such as search and browsing history. However, users are unwilling to
provide explicit feedback, even when they anticipate a long-run benefit [Caroll and
Rosson 1987]. Implicit feedback has shown promising results for personalizing search
using short-term context [Leroy et al. 2003, Shen et al. 2005b]. However, generating user
profiles for long-term context through implicit feedback will take time and may raise
privacy concerns. In addition, a user profile generated from implicit feedback may
contain noise, because the user preferences have been estimated from behaviors and not
explicitly specified. In our approach two user-related inputs: a search query and the user’s
professional interests and skills are explicitly given to a system, so some prior work
[Leroy et al. 2003; Gauch et al. 2003; Liu et al. 2004; Sugiyama et al. 2004; Kraft et al.
2005] that relies on modeling user interests through searching or browsing behavior is not
readily applicable.
28
![Page 29: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/29.jpg)
3 OUR APPROACH
Our approach begins with the assumption that some user interests are known and
therefore is well suited for a workplace setting in which employees’ resumes often are
maintained in a digital form or information about users’ professional interests and skills
is stored in a database. An IE tool or database queries can extract such information as
input to complement the search query, search engine, and contents of the ODP taxonomy.
However, we do not include such an IE program in this study and instead assume that the
interests have been already given. Our interest-category mapping framework tries to
automatically identify an ODP category associated with each of the given user interests.
Then our system uses URLs organized under those categories as training examples to
classify search results into various user interests at query time. We expect the result
categorization to help the user quickly focus on results of interest and decrease total time
spent in searching. The result categorization also may lead to the discovery of
serendipitous connections between the concepts being searched and the user’s other
interests. This form of personalization therefore should reduce search effort and possibly
provide interesting and useful resources the user would not notice otherwise. We focus on
work-related search performance, but our approach could be easily extended to include
personal interests as well. We illustrate a process view of our proposed approach in
Figure 7 and present our approach in five steps. Steps 3 and 4 cover the mapping
framework.
29
![Page 30: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/30.jpg)
Figure 7. Process view of proposed approach
3.1 Step 1: Obtaining an Interest Profile
Step 1 (Figure 7) pertains to how the interests can be extracted from a resume.
Our study assumes that user interests are available to our personalized search system in
the form of a set of words and phrases which we call a user’s interest profile.
3.2 Step 2: Generating Category Profiles
As we explained previously, ODP is a manually edited Web directory with
millions of URLs placed under different categories. Each ODP category contains URLs
that point to external Web pages that human editors consider relevant to the category.
Those URLs are accompanied by manually composed titles and summaries that we
believe accurately represent the corresponding Web page content. The category profile of
an ODP category thus is built by concatenating the titles and summaries of the URLs
30
![Page 31: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/31.jpg)
listed under the category. The constructed category profiles provide a solution to the
cold-start problem, which arises from the difficulty of creating a profile for a new user
from scratch [Maltz and Ehrlich 1995] and later serve to categorize the search results.
Gauch et al. [2003], Menczer et al. [2004], and Srinivasan et al. [2005] use similar
concatenation to build topic profiles. In our study, we combine up to 30 pairs of manually
composed titles and summaries of URL links under an ODP category as the category
profile.8 In support of this approach, Shen et al. [2004] report that classification using
manually composed summarization in the LookSmart Web directory achieves higher
accuracy than the use of the content of Web pages. For building the category profile, we
pick the first 30 URLs based on the sequence in which they are provided by ODP. We
note that ODP can have more than 30 URLs listed under a category. In order to use
similar amount of information for creating profiles for different ODP categories we only
use the titles and summaries of the first 30 URLs. When generating profiles for categories
in Magellan taxonomy, Gauch et al. [2003] show that a number of documents between 5
and 60 provide reasonably accurate classification.
At depth-one, ODP contains 17 categories (for a depth-one category, Computers,
see Figure 6). We select five of these (Business, Computers, Games, Reference, and
Science) that are likely to be relevant to our subjects and their interests. These five broad
categories comprise a total of 8,257 categories between depths one and four. We generate
category profiles by removing stop words and applying the Porter stemming9 [Porter
1980]. We also filter out any terms that appear only once in a profile to avoid noise and
remove any profiles that contain fewer than two terms. Finally, the category profile is
8 A category profile does not include titles or summaries of its child (subcategory) URLs.9 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/porter.java.
31
![Page 32: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/32.jpg)
represented as a term vector [Salton and McGill, 1986] with term frequencies (tf) as
weights. Shen et al. [2004] also use tf-based weighting scheme to represent manually
composed summaries in the LookSmart Web directory to represent a Web page.
3.3 Step 3: Mapping Interests to ODP Categories
Next, we need a framework to map a user’s interests onto appropriate ODP
categories. The framework then can identify category profiles for building text classifiers
that correspond to the user’s interests. Some prior studies [Pitkow et al. 2002; Liu et al.
2004] and the existing Google personalized search use ODP categories with a few
hundred categories up to depth two, but for our study, categories up to depth two may
lack sufficient specificity. For example, Programming, a depth-two category, is too broad
to map a user interest in specific programming languages such as C++, Java, or Perl.
Therefore, we map user interests to ODP categories up to depth four. As we mentioned in
Step 2, a total of 8,257 such categories can be used for interest mapping. We employ four
different mapping methods to evaluate the mapping performance by testing and
comparing them individually, as well as in different combinations. When generating an
output category, a mapping method includes the parent category of the mapped category;
for example, if the mapped category is C++, the output will be Computers/Programming/
Languages/C++.
32
![Page 33: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/33.jpg)
3.3.1 Mapping Method 1 (m1-category-label): Simple Term Match
The first method uses a string comparison to find a match between an interest and
the label of the category in ODP. If an interest is the same as a category label, the
category is considered a match to the interest. Plural forms of terms are transformed to
their singular forms by a software tool from the National Library of Medicine.10
Therefore, the interest of "search engine" is matched with the ODP category "Search
Engines,” and the output category is “Computers/Internet/Searching/Search Engines.”
3.3.2 Mapping Method 2 (m2-category-profile): Most Similar Category Profile
The cosine similarities between an interest and each of the category profiles are
computed, in which case the ODP category with the highest similarity is selected as the
output.
3.3.3 Mapping Method 3 (m3-category-profile-noun): Most Similar Category Profile
While Augmenting Interest With Potentially Related Nouns
The m1-category-label and m2-category-profile will fail if the category labels and
profiles do not contain any of the words that form a given interest, so it may be
worthwhile to augment the interest concept by adding a few semantically similar or
related terms. According to [Harris 1985], terms in a language do not occur arbitrarily but
appear at a certain position relative to other terms. On the basis of the concept of co-
occurrence, Riloff and Shepherd [1997] present a corpus-based bootstrapping algorithm
10 http://umlslex.nlm.nih.gov/nlsRepository/nlp/doc/userDoc/index.html.
33
![Page 34: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/34.jpg)
that starts with a few given seed words that belong to a specific domain and discovers
more domain-specific, semantically related lexicons from a corpus. Similarly to query
expansion, it is desirable to augment the original interest with a few semantically similar
or related terms.
For m3-category-profile-noun, one of our programs conducts a search on Google
using an interest as a search query and finds the N nouns that most frequently co-occur in
the top ten search results (page titles and snippets). We find co-occurring nouns because
most terms in interest profiles are nouns (for terms from some sample user interests, see
Table 1). Terms semantically similar or related to those of the original interest thus can
be obtained without having to ask a user for input such as feedback or a corpus. A noun is
identified by looking up the word in a lexical reference system,11 WordNet [Miller et al.
1990], to determine whether the word has the part-of-speech tag of noun. The similarities
between a concatenated text (a combination of the interest and N most frequently co-
occurring nouns) and each of the category profiles then are computed to determine the
category with the highest similarity as the output of this method.
3.3.4 Mapping Method 4 (m4-category-profile-np): Most Similar Category Profile
While Augmenting Interest With Potentially Related Noun Phrases
Although similar to m3-category-profile-noun, this method finds the M most
frequently co-occurring noun phrases on the first result page from up to ten search
results. We developed a shallow parser program to parse sentences in the search results
into NPs (noun phrases), VPs (verb phrases), and PPs (prepositional phrases), where a NP
11 http://wordnet.princeton.edu/.
34
![Page 35: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/35.jpg)
can appear in different forms, such as a single noun, a concatenation of multiple nouns,
an article followed by a noun, or any number of adjectives followed by a noun.
Table 1 lists some examples of frequently co-occurring nouns and NPs identified by m3-
category-profile-noun and m4-category-profile-np. Certain single-noun NPs generated by
m4-category-profile-np differ from individual nouns identified by m3-category-profile-
noun because a noun identified by this method may combine with other terms to form a
phrase in m4-category-profile-np and therefore not be present in the result generated by
m4-category-profile-np.
Table 1.
Frequently Co-occurring Nouns and NPs
Domain Interest Two co-occurring nouns Co-occurring NP
Computer
C++ programme, resource general cIBM DB2 database, software databaseJava tutorial, sun sunMachine Learning information, game ai topicNatural Language Processing intelligence, speech intelligence
Object Oriented Programming concept, link data
Text Mining information, data text mine toolUML model tool acceptance *
Web Site Design html, development library resource web development
Finance Bonds saving, rate saving bondDay Trading resource, article bookDerivatives trade, international goldMutual Funds news, stock accountOffshore Banking company, formation bank accountRisk Management open source * software risk evaluation *Stocks Exchange trade, information official siteTechnical Analysis market, chart market pullback
35
![Page 36: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/36.jpg)
Trading Cost service, cap product* Some co-occurring nouns or NPs may be not semantically similar or related.
3.4 Step 4: Resolving Mapped Categories
For a given interest, each mapping method in step 3 may generate a different
mapped ODP category, and m1-category-label may generate multiple ODP categories for
the same interest because the same category label sometimes is repeated in the ODP
taxonomy. For example the category “Databases” appears in several different places in
the hierarchy of the taxonomy, such as “Computers/Programming/Databases” and
“Computers/Programming/Internet/Databases.”
Using 56 professional interests in the computer domain, which were manually
extracted from several resumes of professionals collected from ODP (eight interests are
shown in the first column of Table 1), Table 2 compares the performances of each
individual mapping method. After verification by a domain expert, m1-category-label
generated mapped categories for 29 of 56 interests, and only two did not contain the right
category. We note that m1-category-label has much higher precision than the other three
methods, but it generates the fewest mapped interests. Machine learning research [e.g.,
Dietterich 1997] has shown that an ensemble of classifiers can outperform each classifier
in that ensemble. Since the mapping methods can be viewed as classification techniques
that classify interests into ODP categories, a combination of the mapping methods may
outperform any one method.
36
![Page 37: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/37.jpg)
Table 2.
Individual Mapping Method Comparison (Based on 56 Computer Interests)
Mapping method m1 m2 m3 m4Number of correctly mapped interests 27 29 25 19Number of incorrectly mapped interests 2 25 30 36Number of total mapped interests 29 54 55 55Precision (
) 93.0% 53.7% 45.5% 34.5%
Recall ( ) 48.2% 51.8% 44.6% 33.9%
F1 63.5% 52.7% 45.0% 34.2%
Figure 8 lists the detailed pseudo-code of procedure used to automatically resolve
a final set of categories for an interest profile with the four mapping methods. M1
represents a set of mapped category/categories generated by m1-category-label; as do
M2, M3, and M4. Because of its high precision, we prioritize the category/categories
generated by m1-category-label as shown in step (2); if a category generated by m1-
category-label is the same as or a parent category of a category generated by any other
method, we include the category generated by m1-category-label in the list of final
resolved categories. Because m1-category-label uses an “exact match” strategy, it does
not always generate a category for a given interest. In step (3), if methods m2-category-
profile, m3-category-profile-noun, and m4-category-profile-np generate the same mapped
category, we select that category, irrespective of whether m1-category-label generates
one. Steps (2) and (3) attempt to produce a category for an interest by considering
overlapping categories from different methods. If no such overlap is found, we look for
overlapping categories generated for different interests in step (6), because if more than
one interest is mapped to the same category, it is likely to be of interest. In step (8), we
37
![Page 38: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/38.jpg)
try to represent all remaining categories at a depth of three or less by truncating the
category at depth four and thereby hope to find overlapped categories through the parent
categories. Step (9) is similar to step (5), except that all remaining categories are at the
depth of three or less.
(1) For each interest i in interest profile Given i, the four mapping methods generate M1, M2, M3, and M4(2) For each category c in M1 If c is the same as or a parent of a category in M2, M3, or M4, add c to a list of final categories, then go to step (1) End For(3) If M2, M3, and M4 contain the same category c, add c into the list of final categories, then go to step (1)(4) Put any category c in M1, M2, M3, and M4 into a list of candidate categories12
End For(5) For each category c in candidate categories Count the frequency for c End For(6) For each depth-four category c in candidate categories If frequency of c >= threshold, add c into final categories. (We chose the threshold equal to the number of mapping methods – 1. The threshold was three in our tests because we used four mapping methods. The number of three or larger means there is an overlap of candidate category between at least two different interests. Then we choose the overlapped candidate category to represent these interests.) End For(7) Removing all candidate categories for the mapped interests in step (6)(8) Resolving all remaining categories of depth four into depth three by truncating the category at depth four. For example after truncating to depth three from depth four, reference/knowledge management/publications/articles is resolved as reference/ knowledge management/publications(9) For each category c in candidate categories Count the frequency for c End For(10) For each depth-three category c in candidate categories If frequency of c >= threshold, add c into final categories End For
Figure 8. Category resolving procedures
12 Candidate categories cannot be used as final resolved categories unless the frequency of a candidate category is greater than or equal to the threshold in step (6).
38
![Page 39: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/39.jpg)
To determine appropriate values for N (number of nouns) and M (number of NPs)
for m3-category-profile-noun and m4-category-profile-np, we tested different
combinations of values ranging from 1 to 3 with the 56 computer interests. According to
the number of correctly mapped interests, choosing the two most frequently co-occurring
nouns and one most frequently co-occurring NP offers the best mapping result (see Table
1 for some examples of identified nouns and NPs.) With the 56 interests, Table 3
compares the number of correctly mapped interests when different mapping methods are
combined. Using all four mapping methods provides the best results; 39 of the 56
interests were correctly mapped onto ODP categories. The resolving procedures in Figure
8 thus are based on four mapping methods. When using three methods, we adjusted the
procedures accordingly, such as setting the thresholds in steps (6) and (10) to two instead
of three.
Table 3.
Comparison of Combined Mapping Methods
Combination of mapping methodsm1+m2+
m3m1+m2+
m4m1+m3+
m4m1+m2+m3
+m4
Number of correctly mapped interests 34 35 32 39Precision (
)* 60.7% 62.5% 57.1% 69.6%
* Recall and F1 were same as precision because the number of mapped interests was 56.
39
![Page 40: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/40.jpg)
Table 4 lists mapped and resolved categories for some interests in computer and
finance domains.
Table 4.
Resolved Categories
Domain Interest ODP category
Computer
C++ computers/programming/languages/c++IBM DB2 computers/software/databases/ibm db2Java computers/programming/languages/javaMachine Learning computers/artificial intelligence/machine learningNatural Language Processing computers/artificial intelligence/natural language
Object Oriented Programming computers/software/object-oriented
Text Mining reference/knowledge management/ knowledge discovery/text mining
UML computers/software/data administration *Web Site Design computers/internet/web design and development
Finance
Bonds business/investing/stocks and bonds/bondsDay Trading business/investing/day tradingDerivatives business/investing/derivativesMutual Funds business/investing/mutual fundsOffshore Banking business/financial services/offshore servicesRisk Management business/management/software *Stocks Exchange business/investing/stocks and bonds/exchanges
Technical Analysis business/investing/research and analysis/technical analysis
Trading Cost business/investing/derivatives/brokerages*Because the mapping and resolving steps are automatic, some resolved categories are
erroneous.
40
![Page 41: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/41.jpg)
After the automatic resolving procedures, mapped categories for some interests
may not be resolved because different mapping methods generate different categories.
Unresolved interests can be handled by having the user manually map them onto the ODP
taxonomy. An alternative approach could use a unresolved user interest as a query to a
search engine (in a manner similar to m3-category-profile-noun and m4-category-profile-
np), then combine the search results, such as page titles and snippets, to compose an ad
hoc category profile for the interest. Such a profile could flexibly represent any interest
and avoid the limitation of taxonomy, in that it contains a finite set of categories. It would
be worthwhile to examine the effectiveness of such ad hoc category profiles in a future
study. In this article, user interests are fully mapped and resolved to ODP categories.
These four steps are performed just once for each user, possibly during a software
installation phase, unless the user’s interest profile changes. To reflect such a change in
interests, our system can automatically update the mapping periodically or allow a user to
request an update from the system. As shown in Figure 7, the first four steps can be
performed in a client-side server, such as a machine on the organization’s intranet, and
the category profiles can be shared by each user’s machine.
Finally, user interests, even long-term professional ones, are dynamic in nature. In
the future, we will explore more techniques to learn about and fine-tune interest mapping
and handle the dynamics of user interests.
3.5 Step 5: Categorizing Search Results
41
![Page 42: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/42.jpg)
When a user submits a query, our system obtains search results from Google and
downloads the content of up to top-50 results, which correspond to the first five result
pages. The average number of result pages viewed by a typical user for a query is 2.35
[Jansen et al. 2000], and a more recent study [Jansen et al. 2005] reports that about 85-
92% of users view no more than two result pages. Hence, our system covers
approximately double the number of results normally viewed by a search engine user. On
the basis of page content, the system categorizes the results into various user interests. In
PCAT, we employ a user’s original interests as class labels, rather than the ODP category
labels, because the mapped and resolved ODP categories are associated with user
interests. Therefore, the use of ODP (or any other Web directory) is transparent to the
user. A Web page that corresponds to a search result is categorized by (1) computing the
cosine similarity between the page content and each of the category profiles of the
mapped and resolved ODP categories that correspond to user interests and (2) assigning
the page to the category with the maximum similarity if the similarity is greater than a
threshold. If a search result does not fall into any of the resolved user interests, it is
assigned to the “Other” category.
The focus of our study is to explore the use of PCAT, an implementation based on
the proposed approach, and compare it with LIST and CAT. With regard to interest
mapping and result categorization (classification problems), we choose the simple and
effective cosine similarity instead of comparing different classification algorithms and
selecting the best one.
42
![Page 43: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/43.jpg)
3.6 IMPLEMENTATION
We developed three search systems13 with different interfaces to display search
results, and the online searching portion was implemented as a wrapper on Google search
engine using the Google Web API.14 Although the current implementation of our
approach uses a single search engine (Google), following the metasearch approach
[Dreilinger and Howe 1997], it can be extended to handle results from multiple engines.
Because Google has become the most popular search engine15, we use Google’s search
results to feed the three systems. That is, the systems have the same set of search results
for the same query; recall that LIST can be considered very similar to Google. For
simplicity, we limit the search results in each system to Web pages in HTML format. In
addition, for a given query, each of the systems retrieves up to 50 search results.
PCAT and CAT download the contents of Web pages that correspond to search results
and categorize them according to user interests and ODP categories, respectively. For
faster processing, the systems use multithreading for simultaneous HTTP connections
and download up to 10KB of text for each page. It took our program about five seconds
to fetch 50 pages. We note that our page fetching program is not an industry strength
module and much better concurrent download speeds have been reported by other works
[Hafri and Djeraba 2004, Najork and Heydon 2001]. Hence, we feel that our page-
fetching time can be greatly reduced in a production implementation. After fetching the
pages, the systems remove stop words and perform word stemming before computing the
cosine similarity between each page content and a category profile. Each Web page is
13 In experiments, we named the systems A, B, or C; in this article, we call them PCAT, LIST, or CAT, respectively.14 http://www.google.com/apis/.15 http://www.comscore.com/press/release.asp?press=873.
43
![Page 44: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/44.jpg)
assigned to the category (and its associated interest for PCAT) with the greatest cosine
similarity. However, if the similarity is not greater than a similarity threshold, the page is
assigned to the “Other” category. We determined the similarity threshold by testing query
terms from “irrelevant” domains (not relevant to any of the user’s interests). For example,
given that our user interests are related to computer and finance, we tested ten irrelevant
queries, such as “NFL,” “Seinfeld,” “allergy,” and “golden retriever.” For these irrelevant
queries, when we set the threshold at 0.1, at least 90% (often 96% or higher) of retrieved
results were categorized under the “Other” category. Thus we chose 0.1 as our similarity
threshold. The time for classifying results according to user interests in PCAT is
negligible (tens of milliseconds). However, the time for CAT is three magnitudes greater
than that for PCAT because the number of potential categories for CAT is 8,547, whereas
the number of interests is less than 8 in PCAT.
Figure 9 displays a sample output from PCAT for the query “regular expression.”
Once a user logs in with his or her unique identification, PCAT displays a list of the
user’s interests on top of the GUI. After a query is issued, search results are categorized
into various interests and displayed in the result area, as shown in Figure 9. A number
next to the interest indicates how many search results are classified under that interest; if
there is no classified search result, the interest will not be displayed in the result area.
Under each interest (category), PCAT (CAT) shows no more than three results on the
main page. If more than three results occur under an interest or category, a “More” link
appears next to the number of results. (In Figure 9, there is a “More” link for the interest
of “Java.”) Upon clicking this link, the user sees all of the results under that interest in a
new window, as shown in Figure 10.
44
![Page 45: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/45.jpg)
Figure 9. Sample output of PCAT. Category titles are user interests mapped and
resolved to ODP categories
45
user interests result area previous task next task copy paste query field
![Page 46: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/46.jpg)
Figure 10. “More” window to show all of the results under the interest “Java”
46
![Page 47: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/47.jpg)
Figure 11. Sample output of LIST
Figure 11 displays a sample output of LIST for the same query “regular
expression” and shows all search results in the result area as a page-by-page list. Clicking
a page number causes a result page, with up to ten results, to appear in the result area of
the same window. For the search task in Figure 11, the first relevant document is shown
as the sixth result on page 2 in LIST.
47
![Page 48: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/48.jpg)
Figure 12. displays a sample output for CAT, in which the category labels in the
result area are ODP category names sorted alphabetically, such that output categories
under “business” are displayed before those under “computers.”
Figure 12. Sample output of CAT. Category labels are ODP category titles
We now describe some of the features of the implemented systems that would not
appear in a production system but are meant only for experimental use. We predefined a
set of search tasks the subjects used to conduct searches during the experiments that
48
![Page 49: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/49.jpg)
specified what information and how many Web pages needed to be found (Section 5.2.2
describes the search tasks in more detail.) Each search result consists of a page title,
snippet, URL, and a link called “relevant”16 next to the title. Except for the “relevant”
link, the items are the same as those found in typical search engines. A subject can click
the hyperlinked page title to open the page in a regular Web browser, such as the Internet
Explorer. The subject determines whether a result is relevant to a search task by looking
at the page title, snippet, URL, and/or the content of the page.
Many of our search tasks require subjects to find one relevant Web page for a
task, but some require two. In Figure 9, the task requires finding two Web pages which is
also indicated by the number “2” at the end of the task description. Once the user finds
enough relevant pages, he or she can click the “Next” button to proceed to the next task;
clicking on “Next” before enough relevant page(s) have been found prompts a warning
message, which allows the user to either give up or continue the current search task.
We record search time, or the time spent on a task, as the difference between the time that
the search results appear in the result area and the time that the user finds the required
number of relevant result(s).
4 EXPERIMENTS
We conducted two sets of controlled experiments to examine the effects of
personalization and categorization. In experiment I we compare PCAT with LIST, that is,
a personalized system that uses categorization versus a system similar to a typical search
16 When a user clicks on the “relevant” link, the corresponding search result is treated as the answer or solution for the current search task. This clicked result is considered as relevant, and is not necessarily the most relevant among all search results.
49
![Page 50: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/50.jpg)
engine. Experiment II compares PCAT with CAT in order to study the difference
between personalization and non-personalization, given that categorization is common to
both systems. These experiments were designed to examine whether subjects’ mean log
search time17 for different types of search tasks and query lengths varied between the
compared systems. The metric evaluates the efficiency of each system, because all three
systems return the same set of search results for the same query. Before experiment I, we
conducted a preliminary experiment comparing PCAT and LIST with several subjects
who later did not participate in either the experiment I or II. The preliminary experiment
helped us make decisions relating to experiment and system design. Next, we introduce
our experiments I and II in detail.
4.1 Studied Domains and Domain Experts
Because we were interested in personalizing search according to a user’s
professional interests, we chose two representative professional domains, computer and
finance, that appear largely disjointed.
For the computer domain, two of the authors, who are researchers in the area of
information systems, served as the domain experts. Both experts also have industrial
experiences related to computer science. For the finance domain, one expert has a
doctoral degree and the other has a master’s degree in finance.
17 Mean log search time is the average log-transformed search time for a task across a group of subjects using the same system. We transformed the original search times (measured in seconds) with base 2 log to make the log search times closer to a normal distribution. In addition, taking the average makes the mean log search times more normally distributed.
50
![Page 51: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/51.jpg)
4.2 Professional Interests, Search Tasks, and Query Length
4.2.1 Professional interests (interest profiles)
For each domain, the two domain experts manually chose several interests and
skills that could be considered fundamental, which enables us to form a generic interest
profile that would be shared by all subjects within the domain. Moreover, the
fundamental nature of these interests allows us to recruit more subjects, leading to greater
statistical significance in our results. By defining some fundamental skills in the
computer domain, such as programming language, operating system, database, and
applications, the two computer domain experts identified six professional interests:
algorithms, artificial intelligence, C++, Java, Oracle, and Unix. Similarly, the two finance
experts provided seven fundamental professional interests: bonds, corporate finance, day
trading, derivatives, investment banking, mutual funds, and stock exchange.
4.2.2 Search tasks
The domain experts generated search tasks on the basis of the chosen interest
areas but also considered different types of tasks, i.e., Finding and Information Gathering.
The content of those search tasks include finding a software tool, locating a person’s or
organization’s homepage, finding pages to learn about a certain concept or technique,
collecting information from multiple pages, and so forth. Our domain experts predefined
26 non-demo search tasks for each domain, as well as 8 and 6 demo tasks for the
computer and finance domains, respectively. The demo tasks were similar to but not
identical to the non-demo tasks and therefore offer subjects some familiarity with both
51
![Page 52: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/52.jpg)
systems before they started to work on the non-demo tasks. Non-demo tasks are used in
post-experiment analysis, while demo tasks are not. All demo and non-demo search tasks
belong to the categories of Finding and Information Gathering [Sellen et al. 2002], as
discussed in section 2.4, and within the finding tasks, we included some Site Finding
tasks [Craswell et al. 2001].
4.2.3 Query length
Using different query lengths, we specified four types of queries for search tasks in
each domain:
1. One-word query (e.g., jsp, underinvestment)
2. Two-word query (e.g., neural network, security line)
3. Three-word query (e.g., social network analysis)
4. Free-form query, which had no limitations on the number of words used
For a given task a user was free to enter any query word(s) of his or her own choice
that conformed to the associated query-length requirement, and the user could issue
multiple queries for the same task. For example, Table 5 shows some sample search
tasks, types of search tasks, and their associated query lengths.
Table 5.
Examples of Search Tasks, Types of Tasks, and Query Lengths
Domain Search task Type of search task Query lengthComputer You need an open source IDE Finding one-word
52
![Page 53: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/53.jpg)
(Integrated Development Environment) for C++. Find a page that provides any details about such an IDE.
Computer You need to provide a Web service to your clients. Find two pages that describe Web services support using Java technology.
Information Gathering two-word
Finance Find a portfolio management spreadsheet program.
Finding three-word
Finance Find the homepage of New York Stock Exchange.
Site Finding free-form
Table 6 lists the distributions of search tasks and their associated query lengths.
For each domain, we divided the 26 non-demo search tasks and demo tasks into two
groups, such that the two groups have the same number of tasks and distribution of query
lengths. During each experiment, subjects searched for the first group of tasks using one
system and the second group of tasks using the other.
Table 6.
Distribution of Search Tasks and their Associated Query Lengths
Experiment Domain\Query length
One-word
Two-word
Three-word
Free-form
Total tasks
I & II Computer 6 6 4 10 26Finance 8 6 6 6 26
We chose these different query lengths for several reasons. First, numerous
studies show that users tend to submit short Web queries with an average length of two
53
![Page 54: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/54.jpg)
words. A survey by the NEC Research Institute in Princeton reports that up to 70% of
users typically issue a query with one word in Web searches, and nearly half of the
Institute’s staff—who should be Web-savvy (knowledge workers and researchers)—fail
to define their searches precisely with query terms [Butler 2000]. By collecting search
histories for a two-month period from 16 faculty members across various disciplines at a
university, Käki [2005] found that the average query length was 2.1 words. Similarly,
Jansen et al. [1998] find through their analysis of transaction logs on Excite that on
average a query contains 2.35 words. In yet another study, Jansen et al. [2000] report that
the average length of a search query is 2.21 words. From their analysis of users’ logs in
the Encarta encyclopedia, Wen et al. [2002] report that the average length of Web queries
is less than 2 words.
Second, we chose different query lengths to simulate different types of Web
queries and examine how these different types affect system performance. A prior study
follows a similar approach; in comparing the IntelliZap system with four popular search
engines, Finkelstein et al. [2002] set the length of queries to one, two, and three words
and allow users to type in their own query terms.
Third, in practice, queries are often incomplete or may not incorporate enough
contextual information, which leads to many irrelevant results and/or relevant results that
do not appear at the top of the list. A user then has two obvious options: Enter a different
query to start a new search session or go through the long result list page by page, both of
which consume time and effort. From a study with 33,000 respondents, Sullivan [2000]
finds that 76% of users employ the same search engine and engage in multiple search
sessions on the same topic. To investigate this problem of incomplete or vague queries,
54
![Page 55: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/55.jpg)
we associate search tasks with different query lengths to simulate the real-world problem
of incomplete or vague queries. We believe that categorization will present results in such
a way to help disambiguate such queries. Unlike Leroy et al. [2003], who extract extra
query terms from users’ behaviors during consecutive searches, we do not modify users’
queries but rather observe how a result-processing approach (personalized categorization
of search results) can improve search performance.
4.3 Subjects
Prior to the experiments, we sent e-mails to students in the business school and the
computer science department of our university, as well as to some professionals in the
computer industry, to solicit their participation. In these e-mails, we explicitly listed the
predefined interests and skills we expected potential subjects to have. We also asked
several questions, including the following two self-reported ones:
1. When searching online for topics in the computer or finance domain, what do you
think of your search performance (with a search engine) in general?
(a) slow (b) normal (c) fast
2. How many hours do you spend on online browsing and searching per week (not
limited to your major)?
(a) [0, 7) (b) [7+, 14) (c) [14+)
We verified their responses to ensure each subject possessed the predefined skills
and interests. After the experiments we did not manually verify the correctness of
subject-selected relevant documents. However, in our preliminary experiment with
55
![Page 56: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/56.jpg)
different subjects, we manually examined all of the relevant documents chosen by
subjects and we confirmed that, on an average, nearly 90% of their choices were correct.
We assume that with the sufficient background the subjects were capable of identifying
the relevant pages. Because we used PCAT in both experiments, no subject from
experiment I participated in experiment II. We summarize some demographic
characteristics of the subjects in tables 7-1 through 7-3.
Table 7-1.
Educational Status of Subjects
Experiment
Domain\Status
Undergraduate
Graduate Professional
Total
I Computer 3 7 4 14Finance 4 16 0 20
II Computer 3 11 2 16Finance 0 20 0 20
Table 7-2.
Self-reported Performance on Search within a Domain
Experiment Domain\Performance Slow Normal Fast
I Computer 0 8 6Finance 2 15 3
II Computer 1 8 7Finance 2 11 7
56
![Page 57: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/57.jpg)
Table 7-3.
Self-reported Time (hours) Spent Searching and Browsing Per Week
Experiment Domain\Time (hours)
[0, 7) [7, 14)
[14+)
I Computer 1 9 4Finance 5 10 5
II Computer 2 7 7Finance 2 11 7
To compare the two studied systems for each domain, we divided the subjects into
two groups, such that subjects in one group were as closely equivalent to the subjects in
the other as possible with respect to their self-reported search performance, weekly
browsing and searching time, and educational status. We computed the mean log search
time for a task by averaging the log search times for each group.
5.4 Experiment process
In experiment I, all subjects used both PCAT and LIST and searched for the same
demo and non-demo tasks. As we show in Table 8, the program automatically switched
between PCAT and LIST according to the task numbers and the group identified by user
id, so users in different groups always used different systems for the same task. The same
system-switching mechanism was adopted in experiment II to switch between PCAT and
CAT.
Table 8.
57
![Page 58: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/58.jpg)
Distribution of System Uses by Tasks and User Groups
Task Group
First half demo tasks
Second half demo tasks
Non-demo tasks 1–13
Non-demo tasks 14–26
Group one PCAT LIST PCAT LISTGroup two LIST PCAT LIST PCAT
5 EVALUATIONS
In this section, we compare two pairs of systems (PCAT vs. LIST, PCAT vs.
CAT) on the basis of the mean log search time along two dimensions: query length and
type of task. We also test five hypotheses using the responses to a post-experiment
questionnaire provided to the subjects. Finally, we demonstrate the differences of the
indices of the relevant results across all tasks for the two pairs of systems.
5.1 Comparing Mean Log Search Time by Query Length
We first compared the two systems by different query lengths. Tables 9-1 and 9-2
contain the average mean log search times across tasks with the same query length and
1 standard error for different systems in the two experiments (lower values are better).
The last column of each table provides the average mean log search time across all 26
search tasks and 1 standard error. For most of the comparisons between PCAT vs.
LIST (Table 9-1) or PCAT vs. CAT (Table 9-2), for a given domain and query length,
PCAT has lower average mean log search times. We conducted two-tailed t-tests to
determine whether PCAT was significantly faster than LIST or CAT for different
58
![Page 59: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/59.jpg)
domains and query lengths. Table 10 shows the degrees of freedom and p-values for the
t-tests. The numbers in bold in the Tables 9 and 10 highlight the systems with statistically
significant differences (p < 0.05) in average mean log search times.
Table 9-1.
Average Mean Log Search Time across Tasks Associated with Four Types of Query
(PCAT vs. LIST)
Experiment Query length
Domain-SystemOne-word Two-word Three-word Free-form Total
I(PCAT vs.
LIST)
Computer-PCATComputer-LISTFinance- PCAT 3.97 0.34Finance-LIST 5.10 0.26
Table 9-2.
Average Mean Log Search Time across Tasks Associated with Four Types of Query
(PCAT vs. CAT)
Experiment Query length
Domain-SystemOne-word Two-word Three-
word Free-form Total
II(PCAT vs.
CAT)
Computer-PCAT
4.14 0.26
3.88 0.19 4.30 0.15
Computer-CAT 4.96 0.26
4.94 0.34 5.17 0.17
Finance-PCAT 4.10 0.35 4.46 0.14
Finance-CAT 5.10 0.25 5.11 0.16
59
![Page 60: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/60.jpg)
Table 10.
The t-test Comparisons (degrees of freedom, p-values)
Experiment Domain One-word Two-word Three-word Free-form TotalI
(PCAT vs. LIST)Computer 10, 0.058 10, 0.137 6, 0.517 18, 0.796 50, 0.116Finance 14, 0.015 10, 0.370 10, 0.752 10, 0.829 50, 0.096
II(PCAT vs. CAT)
Computer 10, 0.147 10, 0.050 6, 0.309 18, 0.013 50, 0.0005Finance 14, 0.193 10, 0.152 10, 0.237 10, 0.041 50, 0.003
In Table 10, for both computer and finance domains, PCAT has a lower mean log
search time than LIST for one-word query tasks with greater than 90% statistical
significance. The two systems are not statistically significantly different for tasks
associated with two-word, three-word, or free-form queries. Compared with a long query,
a one-word query may be more vague or incomplete, so a search engine may not provide
relevant pages in its top results, whereas PCAT may show the relevant result at the top of
a user interest. The user therefore could directly “jump” to the right category in PCAT
and locate the relevant document quickly.
Compared with CAT, PCAT has a significantly lower mean log search time for
free-form queries (p < 0.05). The better performance of PCAT can be attributed to two
main factors. First, the number of categories in the result area for CAT is often large
(about 20), so even if the categorization is accurate, the user still must commit additional
search effort to sift through the various categories. Second, the categorization of CAT
might not be as accurate as that of PCAT because of the much larger number (8,547) of
potential categories, which can be expected to be less helpful in disambiguating a vague
60
![Page 61: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/61.jpg)
or incomplete query. The fact that category labels in CAT are longer than those in PCAT
may also have a marginal effect on the time needed for scanning them.
For all 26 search tasks, PCAT has a lower mean log search time than LIST or
CAT with 90% or higher statistical significance, except for the computer domain in
experiment I that indicates a p-value of 0.116. When computing the p-values across all
tasks, we notice that the result depends on the distribution of different query lengths and
types of tasks. Therefore, it is important to drill down the systems’ performance for each
type of task.
For reference, Table 11 illustrates the systems’ performance in terms of the
number of tasks that had a lower mean log search time for each type of query length. For
example, the table entry “4 vs. 2” for one-word query in the computer domain of
experiment I indicates that four out the six one-word query tasks had lower mean log
search time with PCAT, whereas two had a lower mean log search time with LIST.
Table 11.
Numbers of Tasks with a Lower Mean Log Search Time
Experiment Domain \ Query length One-word Two-word Three-word Free-form Total
I(PCAT vs. LIST)
Computer 4 vs. 2 6 vs. 0 3 vs. 1 6 vs. 4 19 vs. 7Finance 6 vs. 2 5 vs. 1 3 vs. 3 3 vs. 3 17 vs. 9
II(PCAT vs. CAT)
Computer 4 vs. 2 5 vs. 1 3 vs. 1 10 vs. 0 22 vs. 4Finance 6 vs. 2 6 vs. 0 5 vs. 1 6 vs. 0 23 vs. 3
5.2 Comparing Mean Log Search Time for Information Gathering Tasks
61
![Page 62: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/62.jpg)
According to Sellen et al. [2002], during information gathering, a user finds
multiple pages to answer a set of questions. Figure 13 compares the mean log search
times of the ten search tasks in the computer domain in experiment I that required the
user to find two relevant results for each task. We sorted the tasks by the differences in
their mean log search times between PCAT and LIST. On average, PCAT allowed the
users to finish eight of ten Information Gathering tasks more quickly than did LIST
(t(18), p = 0.005), possibly because PCAT already groups the similar results into a given
category. Therefore, if in a category one page is relevant, the other results in that category
are likely to be relevant as well. This spatial localization of relevant results enables
PCAT to perform this type of task faster than LIST. For the computer domain,
experiment II has a similar result, in that PCAT is faster than CAT (t(18), p = 0.007).
Since the finance domain contains only two Information Gathering tasks (too few to
make a statistically robust argument), we only report the mean log search times for the
tasks in Table 12. We observe that the general trend of the results for the finance domain
is the same as for the computer domain (i.e., PCAT has lower search time than LIST or
CAT).
62
![Page 63: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/63.jpg)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 3 4 5 6 7 8 9 10
Task
Mea
n Lo
g Se
arch
Tim
e
PCAT LIST
Figure 13. Mean log search times for Information Gathering tasks (computer domain)
Table 12.
Mean Log Search Times for Information Gathering Tasks (Finance Domain)
Experiment I Experiment IIPCAT LIST PCAT CAT
Information Gathering task 1 6.33 6.96 6.23 7.64Information Gathering task 2 4.62 5.13 4.72 5.61
5.3 Comparing Mean Log Search Time for Site Finding Tasks
In the computer domain, there were six tasks related to finding particular sites,
such as “Find the home page for the University of Arizona AI Lab.” All six tasks were
associated with free-form queries, and we note that the queries from all subjects
contained site names. Therefore, according to Craswell et al. [2001], those tasks were
Site Finding tasks. Table 13 shows the average mean log search times for the Site Finding
tasks and 1 standard error. There is no significant difference (t(10), p = 0.508) between
63
![Page 64: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/64.jpg)
PCAT and LIST, as shown in Table 14. This result seems reasonable because for this
type of search task, LIST normally shows the desired result at the top of the first result
page when the site name is in the query. Even if PCAT tended to rank it at the top of a
certain category, users often found the relevant result faster with the LIST layout,
possibly because with PCAT, the users had to move to a proper category first and then
looked for the relevant result. However, there is a significant difference between PCAT
and CAT (t(10), p = 0.019); again, the larger number of output categories in CAT may
have required more time for a user to find the relevant site, given that both CAT and
PCAT arrange the output categories alphabetically.
Table 13.
Average Mean Log Search Times for Six Site Finding Tasks in Computer Domain
Experiment System Average mean log search time
I(PCAT vs. LIST)
PCATLIST
II(PCAT vs. CAT)
PCAT 3.51 0.12CAT 4.46 0.32
5.4 Comparing Mean Log Search Time for Finding Tasks
As Table 14 shows, for 16 Finding tasks in the computer domain, we do not
observe a statistically significant difference in the mean log search time between PCAT
and LIST (t(30), p = 0.592), but the difference between PCAT and CAT is significant
(t(30), p = 0.013). However, PCAT has lower average mean log search time than both
64
![Page 65: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/65.jpg)
LIST and CAT. Similarly for 24 Finding tasks in the finance domain, PCAT achieves a
lower mean log search time than both LIST (t(46), p = 0.101) and CAT (t(46), p = 0.002).
The computer domain includes 6 Site Finding tasks of 16 Finding tasks, whereas the
finance domain has only 2 (of 24). To a certain extent, this situation confirms our
observations about Finding tasks in the computer domain. We conclude that PCAT had a
lower mean log search time for Finding tasks than CAT but not LIST.
Table 14.
The t-tests for Finding Tasks
Experiment Domain Type of task Degrees, p-value
I(PCAT vs. LIST)
Computer Site Finding 10, 0.508Computer Finding (including Site Finding) 30, 0.592Finance Finding (including Site Finding) 46, 0.101
II(PCAT vs. CAT)
Computer Site Finding 10, 0.019Computer Finding (including Site Finding) 30, 0.013Finance Finding (including Site Finding) 46, 0.002
5.5 Questionnaire and Hypotheses
After a subject finished the search tasks with the two systems, he or she filled out
a questionnaire with five multiple-choice questions designed to compare the two systems
in terms of their usefulness and ease of use. We use their answers to test several
hypotheses relating to the two systems.
65
![Page 66: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/66.jpg)
5.5.1 Questionnaire
Subjects completed a five-item, seven-point questionnaire in which their
responses could range from (1) strongly disagree to (7) strongly agree. (The phrase
“system B” was replaced by “system C” in experiment II. As explained in footnote 13,
systems A, B, and C refer to PCAT, LIST, and CAT, respectively.)
Q1. System A allows me to identify relevant documents more easily than system B.
Q2. System B allows me to identify relevant documents more quickly than system A.
Q3. I can finish search tasks faster with system A than with system B.
Q4. It’s easier to identify one relevant document with system B than with system A.
Q5. Overall I prefer to use system A over system B.
5.5.2 Hypotheses
We developed five hypotheses corresponding to these five questions. (The phrase
“system B” was replaced by “system C” for experiment II.)
H1. System A allows users to identify relevant documents more easily than system B.
H2. System B allows users to identify relevant documents more quickly than system A.
H3. Users can finish search tasks more quickly with system A than with system B.
H4. It is easier to identify one relevant document with system B than with system A.
H5. Overall, users prefer to use system A over system B.
5.6 Hypothesis Test Based on Questionnaire
66
![Page 67: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/67.jpg)
Table 15 shows the means for the choice responses to each of the questions in the
questionnaire. Based on seven scale options described in section 6.5, we compute
numbers in this table by replacing “strongly disagree” with 1, “strongly agree” by 7, and
so on.
Table 15.
Mean Responses to Questionnaire Items. Degrees of Freedom: 13 for Computer and 19
for Finance in Experiment I; 15 for Computer and 19 for Finance in Experiment II.
Experiment Domain Q1 Q2 Q3 Q4 Q5I
(PCAT vs. LIST)Computer 6.21*** 2.36*** 5.43* 2.71* 5.57**Finance 5.25 3.65* 5.45*** 3.65** 5.40**
II(PCAT vs. CAT)
Computer 6.25*** 2.00*** 6.06*** 2.50*** 6.31***Finance 6.20*** 1.90*** 6.20*** 2.65* 6.50***
*** p < 0.001, ** p < 0.01, * p < 0.05.
As each question in Section 6.5.1 corresponds to a hypothesis in Section 6.5.2, so
we conducted a two-tailed t-test based on subjects’ responses to each question to test the
hypotheses. We calculated p-values by comparing the subjects’ responses with the mean,
“neither agree nor disagree” that had a value of 4. The table shows that for both computer
and finance domains, H1, H3, and H5 are supported with at least 95% significance, and
H2 and H4 are not supported.18 The only exception to these results is that we find only
90% significance (p = 0.083) for H1 in the finance domain of experiment I. According to
18 For example, the mean choice in the computer domain for H2 was 2.36 with p < 0.001. According to our scale, 2 means “disagree” and 3 means “mildly disagree,” so a score of 2.36 indicates subjects did not quite agree with H2. Hence, we claim that H2 is not supported. The same is true for H4.
67
![Page 68: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/68.jpg)
these responses on the questionnaire, we conclude that users perceive PCAT as a system
that allows them to identify relevant documents more easily and quickly than LIST or
CAT.
Several results reported in a recent work [Käki 2005] are similar to our findings.
In particular,
Categories are helpful when document ranking in a list interface fails, which fits
with our explanation of why PCAT is faster than LIST for short queries.
When desired results are found at the top of the list, the list interface is faster, in
line with our result and analysis pertaining to Site Finding tasks.
Categories make it easier to access multiple results, consistent with our report for
the Information Gathering tasks.
However, the categorization employed in [Käki 2005] does not use examples to
build a classifier. The author simply identifies some frequent words and phrases in search
result summaries and uses them as category labels. Hence, each frequent word or phrase
becomes a category (label). A search result is assigned to a category if the result’s
summary contains the category label. Käki [2005] also does not analyze or compare the
two interfaces according to different types of tasks. Moreover, Käki [2005: Figure 4]
shows, though without explicit explanations, that categorization is always slower than a
list. This result contradicts our findings and several prior studies [e.g., Dumais and Chen
2001]. We notice that the system described by Käki [2005] uses a list interface to show
the search results by default, so a user may always look for a desired page from the list
interface first and switch to the category interface only if he or she does not find it within
a reasonable time.
68
![Page 69: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/69.jpg)
5.7 Comparing Indices of Relevant Results
To better understand why PCAT was perceived as faster and easier to use by the
subjects as compared with LIST or CAT, we looked at the indices of relevant results in
the different systems. An expert from each domain completed all search tasks using
PCAT and LIST. Using the relevant results identified by them, we compare the indices of
the relevant search results for the two systems, as we show in Figs. 10-1 and 10-2.
0
5
10
15
20
25
30
35
1 6 11 16 21 26
Task
Inde
x
PCAT (Computer) LIST (Computer)
Figure 14-1. Indices of relevant results in PCAT and LIST (computer domain)
69
![Page 70: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/70.jpg)
0
5
10
15
20
25
30
35
40
45
1 6 11 16 21 26
Task
Inde
x
PCAT (Finance) LIST (Finance)
Figure 14-2. Indices of relevant results in PCAT and LIST (finance domain)
We sort the tasks by the index differences between LIST and PCAT in ascending
order. Thus, the task numbers on the x-axis are not necessarily the original task numbers
in our experiments. Because PCAT organizes the search results into different categories
(interests), the index of a result reflects the relative position of that result under a
category. In LIST, a relevant result’s index number equals its relative position on the
particular page on which it appears plus ten (i.e., the number of results per page) times
the number of preceding pages. Thus, a result that appears in the fourth position on the
third page would have an index number of 24 (4 + 10 × 2). If users had to find two
relevant results for a task, we took the average of the indices. In Figure 14-1, PCAT and
LIST share the same indices in 10 of 26 tasks, and PCAT has lower indices than LIST in
15 tasks. In Figure 14-2, PCAT and LIST share the same indices in 7 of 26 tasks, and
PCAT has smaller indices than LIST in 18 tasks.
70
![Page 71: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/71.jpg)
Similarly, Figs. 11-1 and 11-2 show indices of the relevant search results of
PCAT and CAT in experiment II. The data for PCAT in Figs. 11 are same as those in
Figs. 10, and we show tasks by the index differences between PCAT and CAT in
ascending order. In Figure 15-1 for the computer domain, PCAT and CAT share same
indices in 15 of 26 tasks, and CAT has lower indices in 6 tasks. In Figure 15-2 for the
finance domain, the two systems share same indices in 10 of 26 tasks, and CAT has lower
indices in 14 of 26 tasks.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 6 11 16 21 26
Task
Inde
x
PCAT (Computer) CAT (Computer)
Figure 15-1. Indices of relevant results in PCAT and CAT (computer domain)
71
![Page 72: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/72.jpg)
0
1
2
3
4
5
6
7
8
9
1 6 11 16 21 26
Task
Inde
x
PCAT (Finance) CAT (Finance)
Figure 15-2. Indices of relevant results in PCAT and CAT (finance domain)
The indices for PCAT in Figs. 10 and 11, and CAT in Figs. 11 reflect an
assumption that a user first jumps to the right category and then finds a relevant page by
looking through the results under that category. This assumption may not always hold, so
Figs. 10 may be optimistic in favor of PCAT. However, if the time taken to locate the
right category is not large (as probably in the case of PCAT), the figures provide a
possible explanation for some of the results we observe, such as the lower search times
for PCAT with one-word query and Information Gathering tasks in Experiment I.
However, CAT has smaller index numbers for relevant results than PCAT, which may
seem to contradict the better performance (lower search time) for PCAT in experiment II.
We note that due to its non-personalized nature, CAT has a much larger number of
potential categories as compared to PCAT. Therefore, a user can be expected to take
longer time to locate the right category (before jumping to the relevant result in it) as
compared to PCAT.
72
![Page 73: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/73.jpg)
5.8 CONCLUSIONS
This article presents an automatic approach to personalizing Web searches given a
set of user interests. The approach is well suited for a workplace setting, where
information about professional interests and skills can be obtained automatically from an
employee’s resume or a database using an IE tool or database queries. We present a
variety of mapping methods, which we combine into an interest-to-taxonomy mapping
framework. The mapping framework automatically maps and resolves a set of user
interests with a group of categories in the ODP taxonomy. Our approach then uses data
from ODP to build text classifiers to automatically categorize search results according to
various user interests. This approach has several advantages, in that it does not (1) collect
a user’s browsing or search history, (2) ask a user to provide explicit or implicit feedback
about the search results, or (3) require a user to manually specify the mappings between
his or her interests and taxonomy categories. In addition to mapping interests into
categories in a Web directory, our mapping framework can be applied to other types of
data, such as queries, documents, and e-mails. Moreover, the use of taxonomy is
transparent to the user.
We implemented three search systems: A (personalized categorization system,
PCAT), B (list interface system, LIST,) and C (non-personalized categorization system,
CAT). PCAT followed our proposed approach and categorized search results according
to a user’s interests, whereas LIST simply displayed search results in a page-by-page list,
similar to conventional search engines, and CAT categorized search results using a large
73
![Page 74: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/74.jpg)
number of ODP categories without personalization. We experimentally compared two
pairs of systems with different interfaces (PCAT vs. LIST and PCAT vs. CAT) in two
domains, computer and finance. We recruited 14 subjects for the computer domain and
20 subjects for the finance domain to compare PCAT with LIST in experiment I, and 16
in the computer domain and 20 in finance to compare PCAT with CAT in experiment II.
There was no common subject across the experiments. Based on the mean log search
times obtained from our experiments, we examined search tasks associated with four
types of queries. We also considered different types of search tasks to tease out the
relative performances of the compared systems as the nature of task varied.
We find that PCAT outperforms LIST for searches with short queries (especially
one-word queries) and for Information Gathering tasks; by providing personalized
categorization results, PCAT also is better than CAT for searches with free-form queries
and for both Information Gathering and Finding tasks. From subjects’ responses to five
questionnaire items, we conclude that, overall, users identify PCAT as a system that
allows them to find relevant pages more easily and quickly than LIST or CAT.
Considering the fact that most users (even non-casual users) often cannot issue
appropriate queries or provide query terms to fully disambiguate what they are looking
for, a PCAT approach could help users find relevant pages with less time and effort. In
comparing two pairs of search systems with different presentation interfaces, we realize
that no system with a particular interface is universally more efficient than the other, and
the performance of a search system depends on parameters such as the type of search task
and the query length.
74
![Page 75: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/75.jpg)
5.9 LIMITATIONS AND FUTURE DIRECTIONS
Our search tasks were generated on the basis of user interests. We realize some
limitations of this experimentation setup in adequately capturing the work-place scenario.
The first limitation is that some of the user interests may not be known in a real-world
application, and hence some search tasks may not reflect the known user interests.
Secondly, a worker may search for information that is un-related with his or her job. In
both of these cases tasks may not match up with any of the known interests. However,
these limitations reflect a general fact that personalization can only benefit based on what
is known about the user. A future direction of research is to model the dynamics of user
interests over time.
For the purposes of a comparative study, we carefully separated the personalized
system (PCAT) from the non-personalized (CAT) by maintaining a low overlap between
the two systems. This allows us to understand the merits of personalization alone.
However, we can envision a new system that is a combination of the current CAT and
PCAT systems.
In particular, the new system replaces the “Other” category in PCAT by adding
categories of ODP that match the results that are currently placed in the “Other” category.
A study of such a PCAT+CAT system can be a future direction for this research. An
interesting and related direction is a smart system that can automatically choose a proper
interface (e.g., categorization, clustering, list) to display search results on the basis of the
nature of the query, the search results, and the user interest profile (context).
75
![Page 76: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/76.jpg)
As shown in Figs. 5 and 8, for PCAT in experiments I and II and CAT in
experiment II, we rank the categories alphabetically but always leave the “Other”
category at the end.19 There are various alternatives for the order in which categories are
displayed, such as by the number of (relevant) results in each category or by the total
relevance of results under each category. We recognize that choosing different methods
may provide different individual and relative performances. Also, CAT tends to show
more categories on the main page than PCAT. On one hand, more categories on a page
may be a negative factor for locating a relevant result. On the other hand, more categories
provide more results in the same page which may speed up the discovery of a relevant
result as compared to clicking a “More” link to open another window (as in PCAT
system). We think that the issues of category ordering and number of categories on a
page deserve further examination.
From the subjects’ log files we observed that some of the subjects could not find a
relevant document under a relevant category due to result misclassification, they moved
to another category or tried a new query. Such a situation can be expected to increase the
search time for categorization based systems. Thus, another direction of future research is
to compare different result classification techniques based on their effect on mean log
search time.
It would be worthwhile to study the performance of result categorization using
other types of data such as title and snippets (from search engine results) instead of page
content, which would save the time on fetching Web pages. In addition, it may be
interesting to examine how a user could improve his or her performance in Internet
19 For the computer domain in experiment I, PCAT shows C++ and Java before other alphabetically ordered interests, and the “Other” category is at the end.
76
![Page 77: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/77.jpg)
searches in a collaborative (e.g., intranet) environment. In particular, we would like to
measure the benefit the user can derive from the search experiences of other people with
similar interests and skills in a workplace setting.
ACKNOWLEDGMENTS
During the program development, in addition to the software tools mentioned in
prior sections of the paper, we employed BrowserLauncher20 by Eric Albert and XOM21
(XML API) by Elliotte Rusty Harold. We thank them for their work. We would also like
to thank the Associate Editor and anonymous reviewers for their helpful suggestions.
20 http://browserlauncher.sourceforge.net/.21 http://www.cafeconleche.org/XOM/.
77
![Page 78: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/78.jpg)
DRAFT DISSERTATION PART II BUSINESS RELATIONSHIP DISCOVERY
In Part II, Chapter 6 covers research motivation, presents our approach at a high
level, and reviews prior literature. Chapter 7 introduces network-based attributes and
explains data and data processing related topics. These two chapters are fundamental to
the next two chapters. Chapter 8 studies CRR prediction and Chapter 9 focuses on
competitor discovery. Hereafter, we use the following pairs of terms interchangeably:
network and graph, node and company, link and company pair or pair of companies.
6 INTRODUCTION AND LITERATURE REVIEW
6.1 Introduction
Business news contains rich and current information about companies and the
relationships among them. Online business news from media companies (e.g., Reuters),
content providers (e.g., Yahoo!), and company Web sites offer readers timely
assessments of dynamic company relationships. The task of reading news is very time
consuming and it requires a reader to possess certain skills, the most basic of which is a
good understanding of the language in which the news is written. However, the huge
volume of news stories makes the manual identification, without automated news
analysis, of relationships among a large number of companies nontrivial and unscalable.
78
![Page 79: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/79.jpg)
For professional or personal finance–related interests, many people regularly spend
significant amounts of time scanning the news to monitor recent companies’ financial
milestones. For tasks such as investment or market research, researchers often need to
compare a pair of companies or identify top-performing companies on the basis of
revenue. The company revenue relationships are dynamic and information about them
may not be readily or continuously available. Public companies typically update their
earning or balance sheet data on a quarterly basis, whereas the availability of private,
initial public offering (IPO), or foreign companies’ financials is more limited overall.
Scanning the competitive environment of a company or a group of companies is
essential for supply chain, marketing, investment and strategic partnership
management. Once its competitors have been identified, a company can look for their
product lines, marketing strategies, directions of R&D, key personnel, customers, and
suppliers, and so on to potentially improve its competitive advantage. Analysts and
managers may resort to various options for discovering and monitoring competitor
relationships. These options may include: asking business associates (e.g., customers
or suppliers), reading news, searching on the Web, attending business conventions,
and looking through company profile resources such as Hoover’s22 and Mergent23.
While the availability of company profiling resources has reduced the search effort
and made some of business relationship information easily accessible, the other
above-mentioned approaches, due to their largely manual nature, are still time
consuming and limited in scale. Besides, using possibly different criteria in collecting
and identifying information, businesses that provide company profiles also suffer
22 Hoover’s, Inc., http://www.hoovers.com.23 Mergent Inc., http://www.mergentonline.com.
79
![Page 80: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/80.jpg)
from the scalability problem due to limited resources, manpower and budget, leading
to incomplete and inconsistent information. For example, Hoover considers
Interchange Corp. as a competitor of Google, while Mergent doesn’t specify this
relationship. In contrast, Mergent includes Tercica Inc. as a competitor of
GlaxoSmithKline plc while Hoover’s doesn’t. Therefore, it is important to explore
approaches to automatically discover important business relationships that can
complement and extend existing time consuming efforts. An automated approach also
allows for a timely update of business relationships thus avoiding information
staleness that can mar manual approaches.
Social network analysis (SNA) refers to a set of research procedures for
identifying and quantifying structures in a social network on the basis of relationships
among the nodes [Richards and Barnett 1993]. A social network consists of a set of
nodes, such as individuals or organizations, which are connected through edges that
represent various relationships (e.g., friendship, affiliation) [Wasserman and Faust 1994]
that tend to be simple to identify and yet voluminous to analyze. It is feasible and
effective to discover network structures by analyzing quantitative measures of the
information represented by nodes and edges of social networks for diverse fields, such as
social and behavioral science, anthropology, psychology [Scott 2000], and information
science.
In this study, we present an approach that applies SNA and machine learning
techniques for automated discovery of business relationships. In particular, we study two
different relationships, CRR and competitor relationships, as two illustrative examples of
our approach. Figure 16 illustrates the main steps for discovery of the two relationships at
80
![Page 81: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/81.jpg)
a high level. First with a collection of news stories that have been organized by company,
given that a news story pertaining to a company often cites one or more other companies,
we identify company citations in news stories and treat them as links from the focal
(source) companies to those cited (target) companies, and then construct a directed,
weighted intercompany network. Further we identify four types of network attributes
based on network topology. The four types of attributes differ in their coverage of the
intercompany network. Finally we feed these identified attributes to classification
methods to predict CRR and discover competitor relationship between two companies.
This approach is effective and scalable for business relationship screening, and can be
extended for automated discovery of a broad range of business relationships. Moreover,
the approach is language neutral (i.e., we do not analyze the vocabulary or grammar in
news stories to find relationships). This last feature of the approach can help extend it to
news written in languages other than English.
Figure 16. A high Level Process View for Studying CRR and Competitor
Relationship
6.2 Literature Review
81
![Page 82: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/82.jpg)
Many researchers in areas such as organization behavior and sociology have
investigated the nature and implications of social networks created by business
relationships. For example, Levine [1972], using a network of interlocked directorates
between major banks and large industrial companies, constructs a map of the “sphere of
influence” that provides a quick (though approximate) overview of the relations (e.g.,
well-linked bank–company ties) in the network. Walker et al. [1997] examine an
interfirm network on the basis of cooperative relationships from a commercial directory
of biotechnology firms. Using regression techniques with ten independent variables, they
demonstrate that network structure strongly influences the choices of a biotechnology
startup in terms of establishing new relationships (licensing, joint venture, and R&D
partnership) with other companies. Uzzi [1999] investigates how social relationships and
networks affect a firm’s acquisition and cost of capital. Gulati and Gargiulo [1999]
demonstrate that an existing interorganizational network structure affects the formation of
new alliances which eventually modifies the existing network. A major difference
between those prior studies and ours is that prior works construct a social network using
explicit given relationships from gold standard data sources whereas we try to predict a
business relationship, i.e. CRR, between two companies using structural attributes
derived from citation based intercompany network.
Research in information retrieval and bibliometrics has previously exploited SNA
and graph-theoretic techniques on a network of documents They consider implicit
signals, such as URL links, email communications, or article citations, as links between
nodes and further study problems such as identifying importance of individual nodes in
the network [e.g. Brin and Page 1998; Kleinberg 1999; Garfield 1979] and communities
82
![Page 83: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/83.jpg)
in Web [e.g. Kautz et al. 1997; Gibson et al. 1998], instead of discovering business
relationships between companies.
For example, articles such as scholarly publications can be considered to be
connected with one another through citations. A citation index indexes the citations
among such articles [Garfield 1979]. Using a citation index, a researcher can find not
only articles that a given article cites but also articles that cite the given article. CiteSeer
[Giles et al. 1998] is an example of an autonomous citation indexing system that
retrieves, indexes, and builds bibliographic and citation databases from research articles
on the Web. Furthermore, analyses of the networks created by citations have led to
various measures of prestige and the impact of published articles and the journals in
which they appear. Some measures closely resemble measurements of Web page
“popularity” [Brin and Page 1998] used by Web search engines such as Google.
Park [2003] identifies hyperlink network analysis as a subset of SNA, in which nodes are
Web sites and the relationships are URL links among sites. In such a network, the
linkages among sites reflect the authority, prestige, or trust of the sites [Kleinberg 1999,
Palmer et al. 2000]. Brin and Page [1998] propose the PageRank algorithm to rank the
nodes (pages) on the www network with directed URL links among pages and use the
ranks of pages to order search results. Kleinberg [1999] presents the Hyperlink-Induced
Topic Search (HITS) algorithm to compute the “hub” and “authority” importance
measures for each node (page), also based on the link structure of the www.
Bernstein et al. [2002] apply a commercial information extraction system to
extract company entities from Yahoo business news and posit that two companies have a
relationship (link) if they appear in the same piece of news (co-occurrence approach).
83
![Page 84: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/84.jpg)
The network, which consists of 1,790 identified companies and in which links between
two companies are undirected and unweighted (binary weight), illustrates some central
industry players. They further filter out nodes in the network to produce a smaller
network with 315 companies and 1,047 links, which they use to count how many other
companies are connected with each company, rank all companies by the counts, and
indicate that some of the 30 top-ranked companies in the computer industry are also
Fortune 1000 companies. Hence, their result indicates that companies with high revenues
tend to be linked to many other companies in a network derived purely from news stories.
Their work is somewhat similar to our study, in that they use online business news to
construct an intercompany network. However, unlike Bernstein et al. [2002], we qualify
links in the constructed network by both direction and weights. Furthermore, different
from the abovementioned research we employ various graph-based metrics to predict the
CRR between any pair of companies linked in the network that contains tens of thousands
of such company pairs.
84
![Page 85: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/85.jpg)
7 NETWORK-BASED ATTRIBUTES AND DATA
In this section, we first introduce relevant notation in directed graphs, followed by
notation in directed, weighted graphs. Then we describe data and data processing
procedures. To provide statistical insights into the data, we report distributions of the
various network attributes.
7.1 Notation in Directed Graphs
Figure 17. Directed Graph
Figure 17 presents a directed graph (digraph) that consists of four nodes joined by
eight directed links. More formally, a digraph Gd = (N, L) consists of a set of nodes N and
a set of links L, where
N = (n1, n2, …, nm) and
L = (l1, l2, …, lk), where li = <nsource, ntarget>.
The node indegree, NID(ni), in a digraph is the number of nodes linked to ni; the
node outdegree, NOD(ni), is the number of nodes linked from ni [Wasserman and Faust
85
![Page 86: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/86.jpg)
1994]. Node indegree, or a metric based on it, has been used often to represent authority
and prestige in many prior works [e.g., Brin and Page 1998, Kleinberg 1999]. In this
figure NID(n1) and NOD(n1) are 3 and 2, and NID(n4) and NOD(n4) are 1 and 2.
7.2 Notation in Directed, Weighted Graphs
Web portals such as Yahoo! Finance and Google Finance provide news stories
arranged by company. A news story pertaining to a company (source company) often
cites one or more other companies referred to as target companies. we consider that the
company citation is a directed link (outlink) from the source company to a target and
each citation adds a unit of weight to the link. Finally the link weight between the two
companies is the accumulated citation count across a set of news stories.
Figure 18 depicts a digraph in which each link carries a weight. It is a very small
portion of the intercompany network that consists of five companies/nodes joined by 15
directed and weighted links. More formally, a weighted digraph Gwd = (N, L, W) includes
N, L, and a weight vector W associated with the set of links, where W = (w1, w2, …, wk).
We derive various attributes from the intercompany network that characterize
either a node (one value for each node) or a pair of nodes (one value for each pair). we
divide the various attributes into four types (see Table 16) on the basis of the range of the
network covered for computing the attributes and describe these attributes as follows.
86
![Page 87: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/87.jpg)
Figure 18. Directed, Weighted Graph
DELL: Dell Inc., INCX: Interchange Corp., GOOG: Google Inc., JPM: JP Morgan Chase
& Co., YHOO: Yahoo! Inc.
7.2.1 Dyadic and Node Degree-based Attributes
We first introduce a group of dyadic degree-based attributes as follows.
Dyadic weighted indegree (DWID), DWID(ni, nj) is the weight of the link from nj
to ni.
In Figure 18 the DWID(YHOO, GOOG) is 478.
Dyadic weighted outdegree (DWOD), DWOD(ni, nj) is the weight of the link
from ni to nj.
Again, based on Figure 18, the DWOD(YHOO, GOOG) is 512. We note that both
DWID(GOOG, YHOO) and DWOD(YHOO, GOOG) are large (as compared to
other pairs) and almost equal values. News stories about two competing
companies can be expected to frequently cite the other company and the volume
87
![Page 88: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/88.jpg)
of citations for each company can be expected to be almost equal when there is no
absolute winner (e.g., monopoly).
Dyadic weighted netdegree (DWND)
DWND(ni, nj) = DWOD(ni, nj) – DWID(ni, nj) (1)
Hence, DWND(YHOO, GOOG) = 512 – 478 = 34 shows a net flow of citations in
the direction of pointing to GOOG when we consider the pair <YHOO, GOOG>.
The positive net flow to GOOG may indicate its slight dominance as reflected by
news citations.
Dyadic weighted inoutdegree (DWIOD)
DWIOD(ni, nj) = DWOD(ni, nj) + DWID(ni, nj) (2)
Again, DWIOD(YHOO, GOOG) = 990, which is a relatively large as compared
to other links in the example network. A large DWIOD value may indicate a strong
relationship between the given pair of companies.
The dyadic nature of these attributes captures the flow of citations and hence
potential relationships between a pair of companies. However, dyadic attributes consider
only a pair of connected nodes. To take into account a given node’s neighbors, we
consider the following node degree-based attributes.
Node weighted indegree (NWID)
88
![Page 89: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/89.jpg)
NWID(ni) = (3)
This measures the flow of citations from all companies in the network to the
given company. We expect “important” companies to possibly draw a large total
number of citations in news from other companies.
Node weighted outdegree (NWOD)
NWOD(ni) = (4)
This measures the flow of citations from the given company to all other
companies in the network.
Node weighted inoutdegree (NWIOD)
NWIOD(ni) = (5)
This measures the overall flow of citations both to and from the given company
(ni). In essence, this attribute measures the overall connectivity of the given company
and all neighbor companies in the network independent of the direction of citations.
In Figure 18 for node n1 (YHOO) the NWID, NWOD, and NWIOD values are
513, 541, and 1054 respectively. If a pair of companies has a large DWIOD value as
well as large individual NWIOD values, it may suggest that the two companies have a
strong relationship and are both important players.
89
![Page 90: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/90.jpg)
7.2.2 Centrality-based Attributes
In addition to the dyadic and node degree-based measurements, we also use a
network analysis package [JUNG 2006] to compute scores on the basis of three different
centrality/importance measuring schemas: PageRank [Brin and Page 1998], HITS
[Kleinberg 1999], and betweenness centrality [Brandes 2001]. These schemas extend
beyond immediate neighbors to compute the importance or centrality of a given node in
the whole network. The PageRank algorithm computes a popularity score for each Web
page on the basis of the probability that a “random surfer” will visit the page [Brin and
Page 1998]. The HITS algorithm generates a pair of scores, “hub” and “authority,” for
each page. Both HITS and PageRank compute principal eigenvectors of matrices derived
from graph representations of the Web [Kleinberg 1999], so our use of them for a graph
whose nodes are companies differs from their original use. As a node centrality
measurement, betweenness measures the extent to which a node lies between the shortest
paths of other nodes in the graph [Freeman 1979]. The three schemas do not consider link
weights. JUNG [2006] provides the node authority scores for HITS and ignores the link
direction when computing betweenness centrality. The intuition behind these global
centrality attributes is the same as that for the node degree based attributes but the former
are more informative since they consider the entire network for computation instead of
focusing on immediate neighbors.
90
![Page 91: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/91.jpg)
7.2.3 Structural Equivalance (SE) based Attributes
Lorrain and White [1971] identify two nodes to be structurally equivalent if they
have the same links to and from other nodes in the network. As it is unlikely that two
nodes will be exactly structurally equivalent in our intercompany network, we use a
similarity metric to measure the degree to which two nodes are structurally equivalent.
The intercompany network is represented as a weighted adjacency NxN matrix, where N
is the number of nodes. The SE similarity between two nodes is the normalized dot
product (i.e., cosine similarity) of the two corresponding rows in the matrix, where a
matrix element can be DWID, DWOD, or DWIOD value and therefore producing
DWID-, DWOD-, or DWIOD-based SE similarity. Intuitively, the DWID-based SE
similarity between company A and company B captures the overlap between companies
whose news stories cite A and companies whose news stories cite B (analogous to co-
citation [Small 1973]); the DWOD-based SE similarity reflects the overlap between
companies that news stories of A and B cite (analogous to bibliometric coupling [Kessler
1963]). A high overlap between neighbors of two nodes in our intercompany network
may be reflective of the overlap in their businesses or markets. Intuitively, this
phenomenon may indicate a competitor relationship. For example, for the sample graph
of Figure 18 DWID-based SE similarity between n1 and n3, or YHOO and GOOG, is 0.98
out of 1 for the maximum possible value.
For classifying whether a pair of companies are competitors we use the above
described attributes. As noted earlier, some of the attributes have one value for a pair of
nodes (DWID, DWOD, DWIOD, and three different SE similarities) while others have a
value for each node (NWID, NWOD, NWIOD, pagerank, hits, and betweenness) in the
91
![Page 92: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/92.jpg)
pair. Hence, we use the total of 18 attributes for classifying competitor relationship for a
company pair. Table 16 summarizes these attributes by type and range of network
covered.
Table 16.
Four Types of Network Attributes
Attribute Type Attributes Range of Network CoveredDyadic degree-based DWID, DWOD, DWIOD A given node and only one directly connected nodeNode degree-based NWID, NWOD, NWIOD A given node and all directly connected nodesNode centrality-based pagerank, hits, betweenness Whole networkSE-based DWID-, DWOD-, DWIOD-
based SE similarityAny two nodes and their directly connected nodes in the whole network
7.4 Raw Data
Now we describe the source and nature of the raw data (news stories) and the
process by which we constructed the intercompany network from them. The first data set
consists of eight months (July 2005–February 2006) of business news for all companies
on Yahoo finance [Yahoo]. Both Chapter 8 (predicting CRR) and Chapter 9 (Discovering
Competitor Relationships) use this dataset. In addition, Chapter 8 uses three more
months’ (March–May 2006) news stories from the same data source as a second data set
to validate the major results we obtain from the first. We include all companies across all
nine sectors (Basic Materials, Conglomerates, Consumer Goods, Financial, Healthcare,
Industrial Goods, Services, Technology, and Utilities) in Yahoo finance whose annual
revenue records appeared in the company statistics section in Yahoo finance as of early
92
![Page 93: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/93.jpg)
April 2006 for the first data set and mid-June 2006 for the second data set. The revenue
values represent total revenues in the previous four quarters. So in Chapter 8 we predict
revenue relations using news collected before the revenue records become available.
7.5 Preliminary Data Processing
Yahoo finance organizes business news stories by company and date. The news
stories are not limited to those available from yahoo.com but also include those from
other news sources, such as forbes.com, thestreet.com, and businessweek.com. In other
words, URL links corresponding to news titles that have been organized under a company
in Yahoo finance may point to Web pages located at several domains. Taking advantage
of this organizing mechanism provided by Yahoo, we identify all news pertaining to a
given company within a period of time. For example, for news belonging to Google and
dated February 28, 2006, a page containing both all news titles and their URLs linking to
news content is at http://finance.yahoo.com/q/h?s=GOOG&t=2006-02 -28, where GOOG
is the stock ticker of Google Inc. We automatically construct similar URLs to gather links
of news stories for each company in Yahoo finance across the eight- and three-month
periods that constitute our two data sets. We then programmatically fetch news stories
corresponding to the links. Yahoo may organize the same piece of news under different
companies; we treat such a news story as belonging to each of the companies that Yahoo
identifies.
93
![Page 94: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/94.jpg)
7.6 Node and Link Identification
A news story identifies a company according to its stock ticker on NYSE,
NASDAQ or AMEX. If a piece of news pertaining to a company ni mentions another
company nj, we consider there is a directed link from n i to nj, denoted as <ni, nj>. If
company nj is cited several times in the same piece of news, each citation adds to the
accumulated weight for the directed link. We aggregate citation frequency across all
news stories in a data set. Furthermore, we do not count self-references; therefore, we
ignore citations to company ni if they appear in a news story belonging to n i. For
example, if a news story pertaining to company n1 mentions the companies in the
sequence [n2, n1, n3, n4, n4, n2, n5], we derive the set of links and weight vector as (<n1,
n2>, < n1, n3>, < n1, n4>, < n1, n5>) and (2, 1, 2, 1), respectively. We filter out news stories
that do not mention any other company. After we collected the annual revenues and news
stories for all companies across all nine sectors in Yahoo finance, we emerged with a
total of 6,428 companies and 60,532 news stories for the first data set and 6,246
companies and 36,781news stories for the second data set. For the first data set, we note
that the early months (i.e., July–September 2005) included fewer news stories than later
months, because Yahoo does not archive as many historical news stories as recent ones.
In Table 17, we provide company and news distribution across the nine sectors in the first
data set.
Table 17.
Company and News Distribution across Sectors
Sector Number of Companies
Percentage of Companies
Number of News Stories
Percentage of News Stories
94
![Page 95: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/95.jpg)
Basic materials 522 8.12% 4398 7.27%Conglomerates 30 0.47% 1004 1.66%Consumer goods 496 7.72% 4947 8.17%Financial 1402 21.81% 5512 9.11%Healthcare 706 10.98% 7481 12.36%Industrial goods 423 6.58% 2677 4.42%Services 1334 20.75% 13144 21.71%Technology 1386 21.56% 20723 34.23%Utilities 129 2.00% 646 1.07%Total 6428 100% 60532 100%
7.7 Attribute Distributions
Several variables derived from social phenomena and networks, such as Pareto
distribution of wealth and the frequency of word usage in the English language [Adamic
2002], follow the power law distribution. Recent research shows that several aspects of
digital networks such as the Internet follow power law distributions as well. For example,
the rank and frequency of the outdegrees of Internet domains [Faloutsos et al. 1999] and
the indegree and outdegree of Web page links [Barábasi et al. 2000, Broder et al. 2000,
Kumar et al. 1999] reflect power law distributions. With the directed, weighted
intercompany network, we observe similar power law distributions for various node
degree measurements (NID, NOD, NWID, and NWOD) and link weight. These results
refer to the first data set; the second data set provides very similar results that we do not
report. All logarithms used in the distributions are base 10.
7.7.1 Node Indegree Distribution
Figure 19 shows that the distribution of node indegree (NID) follows a power law
distribution with a Pearson correlation at 0.945 (negative sign ignored). The distribution
95
![Page 96: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/96.jpg)
indicates a few nodes (companies) attract most of the citations, similar to social
phenomena such as the distribution of wealth (Pareto distribution) [Adamic 2002]. We
observe similar power law distributions for other node degree measurements, such as
NOD, NWID, and NWOD. For brevity, we do not show their distribution plots herein.
Figure 19. Node Indegree (NID) Distribution
7.7.2 Link Weight Distribution
Figure 20 shows the link weight distribution in our intercompany network. The
link weight also follows the power law distribution with a Pearson correlation at 0.944.
The power law distribution of link weights indicates there are a few very strong links and
many weak ones.
96
![Page 97: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/97.jpg)
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Log(Weight)
0.00
1.00
2.00
3.00
4.00
5.00
Log(
coun
t)
Figure 20. Link Weight Distribution
7.7.3 Revenue Distribution
We choose a million of dollars as the unit to record the revenue for each
company, group companies with similar logged revenues, and obtain the histogram in
Figure 21, which shows that the (logged) revenues across the 6,428 companies
approximately follow a normal distribution.
97
![Page 98: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/98.jpg)
-1.05-.75-.50-.25.00.25.50.751.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.005.255.50
Log(Revenue)
0
100
200
300
400
500
600
Cou
nt
Figure 21. Revenue Distribution
7.7.4 Revenue Node Weighted Indegree Distribution
-4.00 -2.00 0.00 2.00 4.00 6.00
Log(Revenue)
0.00
1.00
2.00
3.00
4.00
Log(
WID
)
Figure 22. Scatter plot of revenue and node NWID
98
![Page 99: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/99.jpg)
Figure 22 represents a plot of the logged revenues and logged node NWID of all
nodes, with a Pearson correlation of 0.534. Unlike the prior three subsections, we find no
clear pattern for the two variables. In addition, we observe similar distributions for
logged revenue with NID, NOD, and NWOD.
99
![Page 100: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/100.jpg)
8 PREDICTING COMPANY REVENUE RELATIONS
As explained in Chapter 7.6, in our approach nodes in an intercompany network
consist of companies mentioned in business news stories. When determining a link
between two nodes, unlike traditional SNA that uses explicit social relationships (e.g.,
common directorship [Levine 1972], cooperative business relationships [Walker et al.
1997]), we assume a directed link from company A to company B if a news story
pertaining to the company A mentions (cites) company B. Moreover, a link from
company A to company B carries a weight that equals the total number of citations for
company B in a set of news stories belonging to company A. The direction and weight
should provide additional information about the flow and strength of business
relationships in the constructed network. Also, by noting the direction, we can examine
the effects of links coming into a node and those going away from it separately. The
weights in our network reflect the accumulated citations between a pair of companies and
enable us to quantitatively identify a relationship between two companies over time. We
identify a “netdegree” measurement (DWND) that combines the direction and weights to
provide an overall view of the relationship between a pair of companies. Hence, this
approach is more comprehensive than prior related literature on several dimensions,
including a richer network (with weights and direction), a new degree-based metric,
larger data sets, and various analyses related to business relationship prediction.
To illustrate business relationship prediction, in this chapter we focus on
predicting a (positive or negative) CRR between any pair of linked companies and further
100
![Page 101: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/101.jpg)
estimate whether a company’s revenue is in the top-N (where N varies from 100 to 1000)
companies on the basis of the network structure. Before we present our research
questions in detail, we first describe how we measure CRR.
8.1 Measurements for CRR
As we mentioned in the introduction, a positive or negative revenue relation exists
between a pair of companies. However, when the two companies come from different
sectors, their (absolute) revenue values may not be comparable. Therefore, we derive the
following three metrics to determine a positive or negative CRR by taking the size of a
sector into consideration:
Revenue rank, or the rank of the company’s revenue in its sector, namely,
revenue rank(ni) [1, |sector(ni)|], where revenue rank(ni) is company ni’s rank order
in its sector by revenue and |sector(ni)| is the total number of companies in the sector
to which company ni belongs.
Normalized revenue rank(ni) = (6)
Revenue share(ni) = (7)
where revenue(ni) is company ni’s revenue value (in dollars).
101
![Page 102: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/102.jpg)
In Chapter x, we report the detailed results measured by revenue ranks and briefly
mention some results generated by the other two metrics; the results measured by those
two metrics are very similar to those measured by revenue ranks.
8.2 Research Questions
We want to explore the broad hypothesis that attributes derived from a network
constructed from news stories can indicate meaningful business relationships (in
particular, CRR and top-N by revenue). Therefore, we identify attributes that capture the
pairwise relationships between companies (dyadic degree-based) or estimate the
individual importance of each company (node degree-based and node centrality-based).
In each case, the attributes are computed purely from weighted and directed links formed
by citations in news stories. In turn, based on the problem described previously and the
identified network-based attributes, we ask the following specific research questions:
1. Is DWND, which captures the net flow of citations between a pair of companies,
an effective indicator of positive CRR?
2. How well can the attributes derived purely from network structure, as shown in
Table 16, predict CRR for a pair of companies in the network?
3. How does CRR prediction performance differ among the three groups of
attributes (Table 16), which represent different amounts of network covered?
4. How well can individual importance measures of each company, such as node
degree- and centrality-based attributes, predict top-N revenue companies?
102
![Page 103: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/103.jpg)
5. Which of the network structure-based attributes (when combined linearly) are
significant in distinguishing positive and negative CRR?
8.3 Research Methods
Figure 23. Diagram of Methodology and Analysis Approaches
As we discuss in subsequent sections, our methodology, depicted in Figure 23,
generates a directed and weighted intercompany network from business news and uses
the network to address the research questions. For our analysis with pairs of companies,
we use DWND to identify the source and target and ensure each pair is selected only
once: If (ni, nj) is identified as a pair, (nj, ni) cannot be selected. We sort all the links by
their DWND values in descending order and consider only those links whose DWND
values are greater than or equal to 0. For any link <n i, nj> in the network with a DWND
value of 0, we ignore the opposite link <nj, ni>. For the two data sets, we identify 87,340
and 46,725 company pairs, respectively, and use these to predict CRR; we also predict
103
![Page 104: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/104.jpg)
the top-N companies by revenue and note that the ranges of netdegree values are 0–49
and 0–101 for the first and second data sets, respectively.
8.3.1 Classification Methods
Using Weka [Witten and Frank 2005] as a data analysis tool, we employ two
classification methods to evaluate the CRR prediction performance for company pairs
and top-N by revenue. For our classification methods, we select logistic regression and
C4.5 [Quinlan 1993] decision tree (i.e., J48 classifier in Weka). Logistic regression is
frequently used in business research for problems with a binary class label (as for our
CRR and top-N prediction problems); decision tree is one of the commonly used
classifiers in data mining, because it is highly accurate for binary classification problems,
it does not impose assumptions about the distribution of data, and its results are well
suited for human interpretation [Padmanabhan et al. 2006]. We use two different methods
so we may compare their performances for our applications. We also employ artificial
neural network (ANN) as a third classification method and find it offers similar results to
those provided by the decision tree. Therefore, we do not include the results obtained
using ANN. When using each of the classification methods, we employ 10-fold cross-
validation for performance measurements. In a 10-fold cross-validation the data is split
into ten disjoint and equal-size subsets, and nine of the subsets are used for training while
the remaining one is used as a holdout for validation. This process is repeated ten times to
find a robust performance measurement of predictive models. Our performance results
are the average of the ten validations [Michael 1997]. In line with standard metrics used
104
![Page 105: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/105.jpg)
in data mining and information retrieval, we report the average precision, recall, and
accuracy to evaluate the performance of the predictive models:
(8)
(9)
(10)
8.3.2 Discriminant Analysis with Logistic Regression
The main purpose of this paper is to explore the power of structural attributes of
the intercompany network, obtained from news, in predicting CRR. However, we would
also like to investigate the significance (if any) of individual attributes (independent
variables or IVs) in discriminating between positive and negative CRR. Therefore we
perform a discriminant analysis using logistic regression. The linear nature in which
attributes are combined in logistic regression allows for a simplistic understanding of
their individual significance. In particular, from the 87,340 pairs we randomly select
1000 pairs such that each company in the chosen pairs is distinct. As a result, there are
2000 unique companies in the 1000 pairs and hence these pairs are considered
independent. With 12 IVs (DWID and DWOD for source, NWID and NWOD for source
and target, pagerank, hits and betweenness scores for source and target) and CRR as the
dependent variable (DV), we employ binary logistic regression in SPSS (version 12.0) to
105
![Page 106: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/106.jpg)
find the discriminant variables. In particular, we start with a base model that uses the
mean of the DVs and does not include any IVs. Then from a list of candidate IVs which
have statistically significant differences between the two DV groups, we add an
additional IV at one step by choosing the IV having the largest score statistics (method
“Forward: LR” in SPSS) until the stepwise estimation procedure stops (e.g., no remaining
IV is significant) [Hair et al. 2006].
8.4 Results and Analyses
In this section, we first explore how DWND is associated with CRR by
determining whether the net flow of news citations between a pair of companies indicates
the relative size of their revenue. we analyze this attribute since it seems to capture the
overall flow of citations (and importance) from one company to another. Therefore, we
explore whether this importance based on citations reflects the revenue relation between a
pair of companies. we then examine how well the various attributes derived from network
structure predict CRRs for company pairs. To tease out the effects of the three different
sets of attributes—dyad degree, node degree, and node centrality—we repeat the
prediction experiment with each set of attributes separately. We also predict whether a
given company falls into the set of top-N companies by revenue, for which the
explanatory variables are based on companies’ node-level (node degree and node
centrality) attributes. Finally, we report what IVs are significant in distinguishing CRR.
106
![Page 107: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/107.jpg)
8.4.1 Positive CRR and Top Links by DWND
We sort all of the links in the network according to their DWND attribute values
(in descending order). Using a set of the top few links from the sorted list, we compute
the percentage that correctly reflects positive CRR. We then successively increase the
number of top links (T); in Table 18, we provide the number and percentage of the top
links (where T varies from 20 to a few hundred) that follow the positive CRR. We
measure the significance of the percentages in Table 18 through a binomial test. Finally,
we note that if the DWND were independent of CRR, the percentages in Table 18 would
be close to 50%.
Table 18.
Positive CRR in Top-N links
Top Links(T)
DWND Range
Number of Links Following Positive CRR
Percentage of Links Following Positive CRR
20 [24, 49] 16 80.0% *37 [19, 49] 31 83.8% ***64 [16, 49] 50 78.1% ***79 [14, 49] 58 73.4% ***114 [12, 49] 80 70.2% ***135 [11, 49] 92 68.2% ***175 [10, 49] 115 65.7% ***217 [9, 49] 134 61.8% ***289 [8, 49] 172 59.5% ***
* p < 0.05, *** p < 0.001 (two-tailed).
When the DWND values are relatively high, DWND seems to be a good indicator
of positive revenue relations. We observe a similar result for top links in the second data
set.
107
![Page 108: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/108.jpg)
8.4.2 Positive CRR by DWND
As the DWND value decreases, so does the signal indicating the positive CRR
between a pair of companies. To examine this observation further, we segment the links
in the intercompany network into baskets, such that links in each basket have the same
DWND, and combine links with different DWND values into one basket only if the
basket contains fewer than 20 links. In Table 19, we provide the percentages of links
following positive CRR in each basket.
Table 19.
Positive CRR for Links with the Same or Similar DWND
Basket No. DWND Percentage of Links Following Positive CRR
1 1 46.5%2 2 48.8%3 3 46.8%4 4 51.9%5 5 51.8%6 6 57.1%7 7 56.3%8 8 52.8%9 9 45.2%10 10 57.5%11 [11, 12] 55.6%12 [13, 17] 62.5%13 [18, 23] 86.7% ***14 [24, 49] 80.0% *
* p < 0.05, *** p < 0.001 ( two-tailed, binomial test).
When DWND values are small (e.g., less than 10), links in the same baskets do
not display a clear trend toward a positive CRR. In other words, for company pairs in
those baskets, pointing to a company with the same or higher revenue rank is about as
108
![Page 109: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/109.jpg)
likely as pointing to one with lower revenue rank. However, as the DWND values
increase, positive CRR becomes more salient.
In summary DWND can be an indicator of positive CRR for top links, i.e. links
with large DWND values. Overall 48% of the 87,340 pairs whose DWND are non-
negative follow positive CRR, suggesting the indication of DWND disappears when
considering all the pairs.
8.4.3 Predicting CRR
We now attempt to predict positive or negative CRR between a pair of companies
using various attributes derived from the intercompany network. The predicted class label
therefore is a binary number whose values correspond to positive (1) and negative (0)
CRR. We first predict CRR using the attributes identified in Section 3, then split these
attributes into three subsets (Table 16) and observe their predictive power.
8.4.3.1 All Three Attribute Groups
To predict the CRR for each pair of companies, we use a total of 12 attributes (2
dyadic degree-based, 4 node degree-based, and 6 node centrality-based). For the node
degree-based and node centrality-based measures, we employ a pair of attributes for the
source and target companies of each link. Of the dyadic degree-based attributes, we do
not use DWID because it can be derived directly from DWND and DWOD (see Section
109
![Page 110: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/110.jpg)
3.3.2). Table 20-1 shows the results of the two classification methods for the first data set
(87,340 company pairs).
Table 20-1.
Classification Results of CRR with 12 Attributes (First Data Set)
Classification Method
Class Label (CRR)
Number (Percentage) of Pairs Precision Recall Accuracy
Logistic regression
0 45398 (52.0%) 72.9% 76.2% 72.9%1 41942 (48.0%) 72.9% 69.4%
Decision tree 0 45398 (52.0%) 78.4% 79.6% 78.0%1 41942 (48.0%) 76.3% 76.9%Notes: Attributes are DWND, DWOD, source NWID, source NWOD, target NWID, target NWOD, source pagerank, source hits, source betweenness, target pagerank, target hits, target betweenness.
From Table 20-1 we observe that using attributes derived from a network
constructed from news stories, without resorting to any information about a company’s
sector or revenue, we achieve reasonable precision, recall, and accuracy of approximately
70–80% in predicting the CRR between companies. Our data set consists of an almost
equal number of positive and negative CRR instances (see the third column in Table 20-
1), so the prior probability for a link being positive or negative CRR is approximately
50%. In addition, we use revenue value, normalized revenue rank, and revenue share and
achieve very similar results to those in Table 20-1 in terms of precision, recall, and
accuracy. Finally, we divide the 87,340 pairs into two subsets: (1) all pairs in which both
companies in the pair belong to the same sector and (2) the remaining pairs (different
sectors). We examine the prediction performance for each subset separately using
110
![Page 111: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/111.jpg)
revenue rank, normalized revenue rank, and revenue share to determine CRR, and again,
the precision, recall, and accuracy fall around the 70–80% range, similar to those in Table
20-1.
Using the ten accuracy values generated through the 10-fold cross-validation, we
find that the average accuracies of the logistic regression and decision tree differ
significantly (two-tailed t-test, p < 0.001), with decision tree proving to be a superior
method.
To check the robustness of our results, we used the same 12 attributes to predict
the CRR but now using the second data set with 46725 company pairs. Table 20-2 shows
the performance of the two different classification methods on this data set. We note that
performances for precision, recall, and accuracy were very close for those in Table 20-1.
Table 20-2.
Classification Results of CRR with 12 Attributes (Second Data Set)
Classification Method
Class Label (CRR)
Number (Percentage) of Pairs Precision Recall Accuracy
Logistic regression
0 23861 (51.1%) 73.6% 77.2% 74.2%1 22864 (48.9%) 74.9% 71.0%
Decision tree 0 23861 (51.1%) 77.8% 78.9% 77.7%1 22864 (48.9%) 77.6% 76.4%
8.4.3.2 Each Separate Attribute Group
111
![Page 112: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/112.jpg)
We are also interested in comparing the performances with different groups of
attributes separately; in Tables 21, 22, and 23, we provide the associated results for the
first data set.
Table 21.
Classification Results of CRR Using DWND and DWOD
Classification Method
Revenue Relation Precision Recall Accuracy
Logistic regression
0 52.1% 98.7% 52.1%1 54.3% 1.6%
Decision tree 0 52.2% 91.0% 52.0%1 50.1% 9.8%
Table 22.
Classification Results of CRR Using Source NWID, Source NWOD, Target NWID, and
Target NWOD
Classification Method
Revenue Relation Precision Recall Accuracy
Logistic regression
0 69.3% 82.8% 72.0%1 76.4% 60.3%
Decision tree 0 78.7 % 77.8% 77.5%1 76.3% 77.2%
Table 23.
Classification Results of CRR Using Source Pagerank, Source Hits, Source Betweenness,
Target Pagerank, Target Hits, and Target Betweenness
Classification Revenue Precision Recall Accuracy
112
![Page 113: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/113.jpg)
Method RelationLogistic
regression0 73.2% 75.2% 72.8%1 72.4% 70.1%
Decision tree 0 77.6% 78.3% 77.0%1 76.3% 75.5%
The two dyadic degree-based attributes, DWND and DWOD, fail to predict
revenue relations well, whereas the four node degree-based and six node centrality-based
attributes produce results nearly as good those we obtain from using all 12 attributes
together. When we apply the three groups of attributes separately to the second data set,
we obtain very similar results, except that with the decision tree method, the two dyadic
attributes provide higher recalls for both positives and negatives (51%).
The poor performance of dyadic degree-based attributes may be due to their
reliance on the local (pairwise) flow of citations between the two companies. This
localized property of the dyadic attributes may fail to capture the relative importance of
the two companies, which is formed by all the citations they receive from or provide to
many other nodes in the network. The more global node degree- and node centrality-
based measures therefore better predict CRR.
8.4.4 Predicting Top-N Companies by Revenue
We now consider the related problem of predicting whether a company will fall
within the set of top-N companies by revenue (in dollars). Because we are no longer
interested in the direct relation between a pair of companies, we do not use the dyadic
attributes in these predictive methods. We employ five node-level attributes for each
113
![Page 114: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/114.jpg)
company in the network (listed in the caption of Figure 24). The class label to be
predicted takes a value of 1 if the company is a top-N company by revenue and 0
otherwise. Again, we base all performance measurements on 10-fold cross-validation.
Figures 24 and 25 show the performances of the two classification methods as N varies
from 100 to 1000 with a step size of 100.
0%
20%
40%
60%
80%
100%
120%
100 200 300 400 500 600 700 800 900 1000
Top-N
Perc
ent Precision 0
Recall 0Precision 1 Recall 1
Notes: NWID, NWOD, pagerank, hits, betweenness.Figure 24. Precision and recall for logistic regression in predicting top-N companies
0%
20%
40%
60%
80%
100%
120%
100 200 300 400 500 600 700 800 900 1000
Top-N
Perc
ent Precision 0
Recall 0
Precision 1
Recall 1
114
![Page 115: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/115.jpg)
Figure 25. Precision and Recall for Decision Tree in Predicting Top-N Companies
The two classification methods produce similar results. Performance for
predicting the negatives (i.e., a company is not in the set of top-N companies) is high,
with precision and recall (for both methods) in the range of 89–99%. However, precision
for predicting the positives is in the range of 57–75%, and recall is substantially low (24–
36%). We observe similar results with the second data set; for the negatives, both
precision and recall are between 88% and 99%, whereas for the positives, precision is
65–76% and recall is 22–35%. Although these positive prediction performances may
seem rather low, they should be judged with the knowledge that the top-N companies,
where N varies from 100 to 1000, constitute only 1.6–16% of the total number of
companies in the two data sets. That is, the problem of correctly identifying a company in
the set of top-N companies by revenue is particularly hard, whereas identifying a
company that is not in the top-N is easier because most companies fall into this category.
Given the high prior probability of negatives, our results for this problem are
encouraging.
8.4.5 Discriminant Variate
At the first step of the discriminant analysis, before adding the first IV into the
model, we find that ten IVs (node degree-based and centrality-based) are significant (with
significance equal to or less than 0.05) and the (two) dyadic degree-based IVs are not.
115
![Page 116: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/116.jpg)
The result for dyadic degree-based IVs is consistent with what we see in Table 21: those
IVs produce very poor prediction results. The first IV included in the discriminant model
is source_hits score as it has the largest score statistics. After including the source_hits
and repeating the evaluation procedures, the second IV to be added is target_hits. At this
step, all the eight IVs that were significant before including the first IV become
insignificant due to a high multicollinearity among the IVs (i.e. hits, pagerank,
betweenness, NWIO and NWOD). The high multicollinearity among those IVs explains
the similar performance by different sets of IVs in Tables 22 and 23. The coefficient β for
source_hits is negative (-1863.7) and for target_hits is positive (1627.5), which indicates
that an increase in source_hits decreases the likelihood of positive power relation; and
increase in target_hits increases the likelihood of positive power relation. In other words,
global (hub-like) centrality of target company is indicative of its higher revenues and the
reverse is true for the source company. Hence, the global centrality-based hits metrics for
source and target company consist of discriminant variate that can significantly
discriminate between positive and negative CRR. The prediction results obtained using
the discriminant model (with a constant and the two IVs – source_hits and targe_hits) are
as follows:
Table 24.
Prediction Results for Discriminant Model with Two IVs
Discriminant model
Revenue Relation Precision Recall Accuracy
Logistic regression
0 64.6% 47.3% 65.5%1 56.0% 70.5%
116
![Page 117: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/117.jpg)
Compared with Tables 20-1, 22, and 23, Table 24 shows inferior results, indicating that
adding more IVs can improve prediction performance (the main focus of this paper).
8.5 Discussions
We propose a news-driven, SNA-based business relationship discovery approach
to explore the predictive value of business news in discerning relationships between
companies. Our approach uses citations in news stories to understand the direction and
strength of the relative importance between a pair of companies. In our intercompany
network, nodes are companies, and links are directed and weighted on the basis of the
direction and frequency of citations in news stories. We identify and quantify various
attributes of the network using standard network analysis metrics and suggest modified or
new metrics as needed (e.g., DWND). We then use these attributes to predict the (future)
relative revenue relation between a pair of companies as an example of business
relationships the approach might predict. We also investigate whether we can predict if a
given company falls into the set of top-N companies by revenue. We process and employ
two sets of multi-month data from the online business news available at Yahoo finance.
Both data sets reaffirm the robustness of our findings. Applying discriminant analysis we
identify a set of significant IVs.
Attributes derived purely from the constructed network predict CRR well, which
validates our broad hypothesis that news stories and the citations contained within them
117
![Page 118: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/118.jpg)
provide cues about real-world relationships. Moreover, our approach is intrinsically
language independent and can be extended to news in various languages.
Similar to many other networks constructed from the Internet, we find that
various attributes of our network, such as NID, NOD, NWID, NWOD, and link weight,
follow the power law distribution. By exploring the relation between DWND and positive
CRR, we find that company pairs with large DWND tend to be associated with positive
CRR. Hence, as expected, the DWND metric (at least for large values) captures the
overall flow of revenue (importance) between a pair of companies.
We study the CRR prediction problem by using all 12 attributes at the same time,
as well as subgroups individually. The subgroups reflect the different nature of the
attributes, which vary in the range of the network covered for their computations. More
global measures, such as node degree- and node centrality-based attributes, are better
predictors of CRR than are the dyadic degree-based attributes that concentrate only on
pairwise relationships and ignore the rest of the network.
With regard to predicting whether a company’s revenue falls among the top-N,
the precision for predicting the positives (top-N) is much higher than the recall. These
results may seem humble until we consider them in the context of the prior distributions
in the data sets. Considering that only a small percentage of companies fall into the set of
top-N companies by revenue, a precision value in the range of 57–75%, as we achieve, is
encouraging. If our predictive models randomly assign companies to the top-N, the
precision for predicting positives should not exceed 16%. With discriminant analysis on
12 IVs we identify two global contrality-based IVs (hits scores for source and target) are
significant in distinguishing the CRR.
118
![Page 119: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/119.jpg)
Our approach thus can not only serve as a data filtering step for analysts but also
be useful for tracing and monitoring the dynamics of revenue relations for many
companies over time. After validating the value of news in discerning meaningful, real-
world relationships, we continue to explore richer business relationships such as
competitors with the same network construction approach. Our preliminary results with
this new relationship prediction problem (i.e., predicting competitors) have been very
encouraging, and we plan to further validate our approach with a variety of business
relationships, news from different languages (and countries), various types of companies
(e.g., private versus public), and over time. Further research also might attempt to derive
and evaluate additional graph attributes that synthesize the global and dyadic measures
that represent more effective predictors of business relationships between a pair of
companies.
119
![Page 120: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/120.jpg)
9 DISCOVERING COMPETITOR RELATIONSHIPS
9.1 Approach Outline and Research Questions
Figure 26 outlines the five main steps of our approach on competitor discovery.
The first two steps have been explained at Chapter 6.1. In step 3, as a preliminary
investigation, we first examine the citation-based intercompany network for both its
competitor coverage (coverage of known competitors) and competitor density (the
likelihood of finding competitors among the linked company pairs in the network.) We
benchmark this preliminary investigation against an exhaustive as well as a random
search to provide a comparative analysis of a citation-based intercompany network in
terms of search cost. We find that competitor relationship discovery is especially
challenging in portions of our data set where the number of non-competitor pairs
overwhelm the number of competitor pairs. We use a combination of data from Hoover’s
and Mergent as our gold standards for evaluation purposes.
Figure 26. Process View of the Approach
120
![Page 121: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/121.jpg)
This study focuses on the following two research questions:
1. How well can we discover competitor relationships between companies using four
types of attributes derived from the intercompany network? In particular, using
special classification techniques suited for imbalanced dataset, we report the
classification performance for imbalanced data set.
2. To what extent can a gold standard cover the set of all competitors and to what
extent does our approach extend the knowledge (i.e. competitors) covered by a
gold standard? We use Hoover’s and Mergent as gold standards for identifying
competitors. However, we are keenly aware that these data sets are not complete
or consistent as illustrated by the examples earlier. Hence, we try to estimate their
coverage on all competitor pairs and also propose metrics to estimate how much
our approach extends the knowledge available in the each of the two gold
standard data sources.
9.2 Datasets
In the following two subsections, we introduce two datasets that will be used to
evaluate competitor classification performance. The first dataset represents a whole set of
pairs in the network, and the second is created to represent the imbalanced part of the
whole dataset.
9.2.1 Dataset I (Instance Selection and Labeling of Dataset I with 840 Company Pairs)
121
![Page 122: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/122.jpg)
We first use DWND (net flow of citations between a pair of companies) to
identify all distinct (linked) company pairs in the network by including only pairs with
non-negative DWND values, and for any link <ni, nj> with a DWND value of 0, we
ignore the opposite link <nj, ni>. For example, in this way we identify a total of eight
links in Figure 18. For the whole intercompany network we identify a total of 87,340
company pairs. Next, we sort the pairs by their DWIOD values in descending order. We
find that the range of DWIOD is between 1 and 990. The reason to choose DWIOD for
ordering the company pairs is that DWIOD captures the total volume of citations between
two companies in the news. We expect that the larger the number of citations in news
stories between two companies the higher the likelihood of a business relationship
between them. We find that, in terms of DWIOD values, the data set is skewed in that
most of the company pairs have small DWIOD values. In order to examine the
competitor relationship we drill down and group company pairs with the same or similar
DWIOD values. In particular we divide all company pairs into baskets based on their
DWIOD values such that links with the different DWIOD value do not appear in the
same basket unless the basket contains fewer than 200 pairs. This procedure results in 21
baskets associated with different DWIOD values. We randomly choose 40 pairs from
each basket to create 21 sample baskets. The 840 pairs (40 x 21) consist of our dataset I
which we will use to examine the classification performance for individual baskets in
Section 5.
We manually identify whether each of the 840 company pairs in the 21 sample
baskets are competitors using Hoover and Mergent respectively. A class label of 1 is
assigned to a pair (positive instance) if we find a competitor relationship between the two
122
![Page 123: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/123.jpg)
companies by either Hoover’s or Mergent; otherwise, a class label of 0 is assigned
(negative instance). Table 25 shows the DWIOD range and the size of each basket, the
number and percentage of competitor pairs in 21 sample baskets. The table illustrates a
general trend that a higher DWIOD value tends to be associated with a higher percentage
of competitor pairs in a sample basket. This matches our previously mentioned intuition
that as the overall volume of citations between a pair of companies increases we may
expect the companies to have a business relationship (such as being competitors) with
greater likelihood.
Table 25.
Distribution of Competitor Pairs in 21 Sample Baskets
Basket DWIOD range
Basket size
Number (percent) of positives in a
sample basket by Hoover’s
Number (percent) of positives in a
sample basket by Mergent
Number (percent) of positives by
union of Hoover’s and Mergent
Number (percent) of positives by
intersection of Hoover’s and Mergent
1 [69, 990] 200 26(65.0%) 11(27.5%) 26(65.0%) 11(27.5%)2 [44, 68] 209 19(47.5%) 9(22.5%) 19(47.5%) 9(22.5%)3 [32, 43] 224 17(42.5%) 6(15.0%) 17(42.5%) 6(15.0%)4 [26, 31] 239 14(35.0%) 4(10.0%) 15(37.5%) 3(7.5%)5 [22, 25] 212 14(35.0%) 8(20.0%) 15(37.5%) 7(17.5%)6 [19, 21] 235 17(42.0%) 6(15.0%) 18(45.0%) 5(12.5%)7 [17, 18] 224 8(20.0%) 5(12.5%) 11(27.5%) 2(5.0%)8 [15, 16] 281 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)9 [13, 14] 389 10(25.0%) 4(10.0%) 10(25.0%) 4(10.0%)
10 12 263 16(40.0%) 3(7.5%) 17(42.5%) 2(5.0%)11 11 330 8(20.0%) 4(10.0%) 9(22.5%) 3(7.5%)12 10 410 8(20.0%) 2(5.0%) 8(20.0%) 2(5.0%)13 9 470 8(20.0%) 3(7.5%) 8(20.0%) 3(7.5%)14 8 622 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)15 7 769 10(25.0%) 3(7.5%) 11(27.5%) 2(5.0%)16 6 1,390 5(12.5%) 3(7.5%) 6(15.0%) 2(5.0%)17 5 1,543 5(12.5%) 2(5.0%) 5(12.5%) 2(5.0%)18 4 4,142 4(10.0%) 0(0.0%) 4(10.0%) 0(0.0%)19 3 4,972 2(5.0%) 2(5.0%) 4(10.0%) 0(0.0%)20 2 29,603 1(2.5%) 0(0.0%) 1(2.5%) 0(0.0%)21 1 40613 0(0.0%) 0(0.0%) 0(0.0%) 0(0.0%)
123
![Page 124: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/124.jpg)
Total 87340 218 87 230 75
9.2.2 Datasets II and III (Instance Selection and Labeling of Dataset II with 2000
Company Paris)
In an imbalanced dataset a majority of instances are labeled as one class, while
the minority is labeled as the other class which is typically the more important class
[Kotsiantis, et al. 2006].
Several sample baskets in Table 17 show low percentages of positives and can be
considered imbalanced datasets. As prior research [e.g. Weiss and Provost 2003] and also
our results in Section 5 empirically show that typical classification methods fail to detect
the minority from an imbalanced dataset in that they generate very low precision and
recall on positives (e.g. close to 0%) which are the competitor pairs and therefore of
interest in this study. The main reason for the poor performance on positives is that the
classifiers by default maximize accuracy which in turn gives more weight on the majority
classes than on minority ones [Kotsiantis, et al. 2006]. For example, for a dataset with 1%
positives, simply assigning every instance a negative label and not detecting any positives
will achieve accuracy of 99%. To handle the imbalanced dataset problem we first create a
larger dataset, dataset II, by proportionally (according to their basket sizes) sampling a
total of 2000 pairs from the four imbalanced baskets (#18, 19, 20, and 21) because their
corresponding sample baskets have the lowest ratio of positives (≤10%).
We manually label the 2000 pairs using Mergent and Hoover’s respectively. The
numbers and percentages of competitors by different gold standards are displayed in
Table 18.
124
![Page 125: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/125.jpg)
Table 26.
Number (percentage) of Positive Pairs in Dataset II
DWIODSamplebasket size
Number (percent) of positives by
Hoover’s
Number (percent) of positives by
Mergent
Number (percent) of positives by union of
Hoover’s and Mergent
Number (percent) of positives by intersection of Hoover’s and Mergent
1 1024 22 (2.1%) 15 (1.5%) 29 (2.8%) 8 (0.8%)2 747 30 (4.0%) 13 (1.7%) 39 (5.2%) 4 (0.5%)3 125 12 (9.6%) 3 (2.4%) 14 (11.2%) 1 (0.8%)4 104 15 (14.4%) 7 (6.7%) 18 (17.3%) 4 (3.8%)Total 2000 79 (4.0%) 38 (1.9%) 100 (5.0%) 17 (0.9%)
In future analysis, besides datasets I and II, we also use 17 baskets (#1 – #17) in
dataset I and all pairs in dataset II to produce estimated overall performance results. For
convenience, hereafter we call such a combination of the two datasets as dataset III that
contains 18 baskets where dataset II is the 18th sample basket.
9.3 Examining Competitor Coverage & Competitor Density of the Intercompany
Network
In this section we examine two issues - how complete the intercompany network
is in its coverage of competitor pairs (i.e., competitor coverage) measured by the links,
and what is the likelihood of being competitor pairs for the links of the intercompany
network (i.e., competitor density). Hence, we are interested in understanding the extent to
which the “competitor semantics” is embedded in the links of the constructed network.
The higher the competitor coverage and competitor density of the intercompany network
the lower the cost of searching (and classifying) for competitors using the network. We
125
![Page 126: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/126.jpg)
benchmark the competitor coverage of the intercompany network against that of an
exhaustive network which is a clique where all nodes are linked to each other. Also we
compare the competitor density of the intercompany network with that of a random
network that has the same numbers of nodes and links as those of the intercompany
network. Table 27 includes notation for examining competitor coverage and competitor
density.
Table 27.
Notation for Competitor Coverage and Competitor Density
Notation InterpretationN Number of unique companies in a sample basket with 40 company pairs.
CL Citation-based links among the N companies in the intercompany network.
EL Exhaustive links among the N companies.CP(CL) Number of competitor pairs (CP) that present in CL.CP(EL) Number of competitor pairs that present in EL.Competitor coverage ratio
CP(CL)/CP(EL), the proportion of all known competitor pairs that are present as links in a citation-based intercompany network.
CP40(CL) Number of competitor pairs that present in 40 links from a sample basket.
RL Randomly generated company links from the N companies.CP40(RL) Number of competitor pairs that present in 40 randomly generated links.
CD40(CL) CP40(CL)/40, competitor density for a small citation-based network that consists of the 40 links from a sample basket.
CD40(RL) CP40(RL)/40, competitor density for a random network that consists of 40 random links.
CD(EL) CP(EL)/(N*(N-1)), competitor density for an exhaustive network (clique) that consists of the exhaustive links.
9.3.1 Examining the Competitor Coverage
126
![Page 127: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/127.jpg)
From 40 company pairs in each sample basket in dataset I, we identify N and EL.
With the whole intercompany network we further find CL. CP(CL) and CP(EL) are
determined by the union of Hoover’s and Mergent. Figure 27 shows the competitor
coverage ratio for the intercompany network across the 21 sample baskets. We find that
the competitor coverage ratio is always greater than 66% and typically in the range of 85-
100% across the sample baskets. We also note that CL is a fraction of EL, ranging from
15% to 84%, across the sample baskets. Hence, our classification models will explore a
small subspace of all possible relationships by starting with the intercompany network
and the subspace covers most of the competitor pairs.
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 101112131415161718192021Basket
Rat
io
Figure 27 Competitor Coverage Ratio
9.3.2 Examining the Competitor Density
127
![Page 128: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/128.jpg)
Using the union of data from Hoover’s and Mergent we label 40 company pairs in
each sample basket to find CP40(CL). Given N we randomly generate 40 links from the N
unique companies, and find CP40(RL). We repeat the procedures of random link
generation and link labeling 4 times and obtain an average number of CP40(RL). Then we
compute the competitor density CD40(CL)) and average CD40(RL) for all sample baskets.
Moreover, since we know CP(EL) from the subsection 4.1, we also calculate CD(EL).
Figure 28 shows the competitor density for the citation-based intercompany network,
random network, and exhaustive network across the 21 sample baskets. The curve for
average CD40(RL) is very close to that of CD(EL), indicating that the probability of
being a competitor pair in the randomly generated 40 pairs is consistent with that in the
exhaustive links. Moreover, CD40(CL) is much higher than average CD40(RL) and
CD(EL) in 20 of the 21 sample baskets. The difference in those probabilities tell us that
pairs in the intercompany network, for most of the baskets, are much more likely to be
competitor pairs than those in random links. The high competitor density in the
intercompany network for most sample baskets would be beneficial for classifiers in
competitor classification.
128
![Page 129: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/129.jpg)
0%
10%
20%
30%
40%
50%
60%
70%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket
Pro
babi
lity
CD40(CL) CD40(RL) CD(EL)
Figure 28. Probability of being a Competitor Pair
The results in the subsections 4.1 and 4.2 show that the citation-based
intercompany network has high competitor coverage and density and hence can be
expected to alleviate the problems associated with searching for competitors in an
exhaustive or random space of potential relationships. The results also confirm our
intuition that links in the citation-based intercompany network contain signals about
competitor relationship instead of being random. We are now ready to explore the
learning power of models based on the topological attributes of the intercompany
network described in Section 3 in discovering competitor relationships.
9.4 Competitor Discovery
Our competitor classification models use four types of attributes to classify a
company pair into competitors or non-competitors. The class label (dependent variable)
129
![Page 130: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/130.jpg)
in the models is binary in nature while all of the classification attributes/variables are
continuous. This setup allows for applying a variety of standard binary classification
models. As is common in machine learning/data mining we use part of the data set of
training and leave a disjoint testing set to evaluate the discriminating power of the
models. In fact this training-testing process is repeated several times with different splits
of the data (cross-validation) to assure the robustness of observed results. We evaluate the
discriminating power of the models based on several standard metrics that are described
next.
9.4.1 Evaluation Metrics
Table 28 is the confusion matrix containing the actual and classified classes for a
classification problem with two class labels. TP is the number of true positives, TN is the
number of true negatives, FP is the number of false positives, and FN is the number of
false negatives.
Table 28.
Confusion Matrix
Classified class label
Positive NegativeActual
class labelPositive TP FNNegative FP TN
130
![Page 131: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/131.jpg)
Using the confusion matrix we introduce the common metrics for evaluating and
comparing classification performance as follows:
(11)
(12)
(13)
(14)
(15)
In most classification problems precision and recall present a trade-off. As a
model tries to be conservative while classifying competitors in order to be boost
precision, it is expected to miss out on some of the competitors and hence achieve
reduced recall. The F measure is based on both precision and recall and the parameter α
denotes the relative importance of recall vs. precision. The parameter α is set to 1 to
produce F1, harmonic mean of precision and recall. Note that throughout the paper the
precision, recall and F measures are based only on the positive class (competitors) which
is the more important class here.
131
![Page 132: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/132.jpg)
One of the most common metric to evaluate classifiers for imbalanced dataset is
Receiver Operating Characteristics (ROC) curve [Kotsiantis, et al. 2006]. It is a two
dimensional curve where TP rate (recall) is plotted on the y-axis and FP rate is on x-axis
(for specific examples see Figure 31). So a ROC curve can address an important tradeoff
– the number of correctly identified positives increases at the expense of introducing
additional false positives. The area under ROC, which is called AUC, is also an
evaluation metric.
9.4.2 Competitor Classification with Dataset I
Using the publicly available Weka API [Witten and Frank 2005], we employ four
classification methods – Artificial Neural Network (ANN), Bayes Net (BN), C4.5
decision tree (DT), and logistic regression (LR) – to classify whether a pair of companies
are competitors or not. Models based on ANN, BN, and DT are commonly used as
classifiers in data mining. LR is frequently used in business research for problems with a
binary class label (as for our competitor classification problem). For each sample basket,
except for sample basket # 21 which does not contain any competitor pair (we will handle
this basket together with three other baskets as the imbalanced dataset II in the next
subsection), we report the average precision and recall generated by 10-fold cross-
validation for each classification method. We use different classification methods so that
we may compare their performances for our application.
132
![Page 133: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/133.jpg)
9.4.3 Competitor Classification with Dataset II
9.4.3.1 Background on Handling Imbalanced Dataset
Solutions to handling imbalanced dataset for classification problems exist at both
data and algorithmic levels. Several data level solutions use different re-sampling
approaches, such as undersampling majority, oversampling minority, or oversampling
minority by creating synthetic minority [Chawla et al. 2002], in order to change the prior
distribution of the original dataset [Kotsiantis, et al. 2006] before learning from the
dataset. Another approach in the data level is to segment the whole data into disjoint
regions such that the data in certain region(s) is not imbalanced [Weiss 2004].
Some of the popular solutions at the algorithmic level include the following:
Decision threshold adjustment (DTA)
Given a (normalized) probability of an example of being positive (or negative),
DTA changes the threshold that is used to decide which class label the instance is
assigned to [Kotsiantis, et al. 2006].
Cost-sensitive learning (CSL)
This method assigns fixed and unequal costs to different misclassifications, for
instance cost(false negative) > cost(false positive), such that the goal of CSL is to
minimize the cost of misclassification [Pazzani et al. 1994].
Recognition-based learning (RBL)
Different from a two-class classification method which learns rules for both
positive and negative classes, RBL is a one-class learning method in that it learns
only rules that classify the minority [Weiss 2004, Kotsiantis, et al. 2006].
133
![Page 134: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/134.jpg)
In this paper we employ several techniques discussed above to handle our
imbalanced dataset. We use DWIOD to divide the whole dataset into 21 baskets, many of
which turn out to be more “balanced” than the entire data set. Hence the basketing
approach matches the “segment data” approach [Weiss 2004] for handling imbalanced
data sets. For the few imbalanced baskets, we sample more examples to form our
imbalanced dataset II. Then we employ two different approaches to attack the imbalanced
dataset problem. The first is the simple DTA approach and the second is an
undersampling-ensemble (UE) method (explained in Section 5.3.3). We do not choose
CSL approach mostly because we do not know what would be the right ratio for the cost
of FN vs. the cost of FP in the context of competitor classification problem. However we
think that essentially DTA and CSL are very similar in that they both create a bias
towards positive classifications. For either DTA or EU approach we still employ the same
four classification methods (ANN, BN, C4.5 and LR) from Weka. With the dataset II we
report various performance metrics suited for imbalanced dataset, including F1, precision,
TP rate, FP rate, ROC, AUC, and accuracy. Next we introduce the two approaches in
detail.
9.4.3.2 DTA Approach
By this approach we simply adjust the decision threshold which is used by a
classifier to decide whether an instance is classified as positive or negative given its
(normalized) probability of being positive. For example, given that Pr(x is positive) = 0.3,
134
![Page 135: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/135.jpg)
the instance x is labeled as negative when the decision threshold is at 0.5. However, when
the threshold is adjusted to 0.2, x is classified as a positive.
For training and testing, we follow strict tuning procedures recommended in
[Salzberg 1997]. In particular, we randomly select 1500 instances as training set from the
imbalanced data set and the remaining 500 as testing set. Next, for each of the
classification methods we use 10-fold cross validation and tune input parameters to
observe the best performance on F1 measure using just the training set. Finally, we apply
each trained classifier with its respective “best” parameter setting to the testing set for
evaluation purposes. Moreover for robustness concern, we randomly divide the 2000
pairs into four disjoint sets of equal size which form four different pairs of training and
testing sets. Then we apply the above training-tuning-testing procedures to the four pairs
of training and testing sets, and report the average results (see formula in subsection 5.5).
We note that, in each case, training and parameter tuning is based only on the training
data set and evaluation of the trained and tuned classifier is based only on the testing data
set. For ANN, we tune learning rate from 0.1 to 1.0 and momentum from 0.1 to 0.3; for
BN we choose K2 [Cooper and Herskovitz 1992] and TAN[Friedman et al. 1997] as an
algorithm for search network structure; for DT, we change the minimum leaf size from 2
to 10; no parameter turning needed for LR. Besides we accept all other parameters as
default from Weka. We apply the same tuning procedures throughout study whenever
parameter tuning is used.
9.4.3.3 UE Approach
135
![Page 136: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/136.jpg)
From the original imbalanced dataset II we generate multiple smaller “more-
balanced” sub datasets by duplicating all minority (positive) instances in each of the sub
sets and then evenly splitting the majority into those sub sets as depicted in Figure 29. A
classifier can be built from each sub dataset and an ensemble approach [Estabrooks and
Japkowicz 2001] can be used to generate the final classification result. Chan and Stolfo
[1998] adopt the similar undersampling method. We choose the majority vote as the
ensemble approach, and for the majority vote we use the binary output (0 or 1) of each
classifier and the probability output (between 0 and 1) of each classifier respectively and
denote them as majority vote by count (MVC) and majority vote by probability (MVP).
Figure 29. Generating More Balanced Sub Datasets
During the training phrase, with an initial ratio of positives in the subsets we tune
the parameters for each classifier (no parameter tuning needed for LR) and record the
performance of each classifier in an output file. Then we repeat the above procedure each
136
![Page 137: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/137.jpg)
time with a different ratio of positives, which changes from 0.05 to 0.60 with a step size
of 0.05. From all output files, on the basis of the best performance on F1 measure we
determine a set of “best” parameters for a classifier and a best ratio of positives. Finally,
we apply the trained classifiers with their best parameter sets and the best ratios of
positives to the testing set for evaluation. Similarly, we divide the 2000 pairs into four
disjoint sets of equal size and generate results separately for the four pairs of training and
testing sets, and report the average results.
9.4.4 Classification Performance for Dataset I
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket
Precision Recall Prior
Figure 30. Precision and Recall of Dataset I by ANN and Prior Distribution
137
![Page 138: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/138.jpg)
Figure 30 shows precision and recall achieved by ANN for individual sample
baskets. For comparison purpose, we also include the prior distribution of positives in
each sample baskets. The precision curve is almost always above the prior probalibity
except for the last two sample baskets which have the lowest prior distributions (5% and
2.5% respectively.) Hence, Figure 30 shows that while for most baskets the classification
performance of ANN is reasonably good, the performance is rather strong when the
DWIOD values are large (initial baskets) and rather weak when DWIOD values are very
small (last few baskets). The result highlights the inherent challenge of accurately
classifying the minority class for imbalanced data sets (the last few baskets). The other
three classification methods (BN, DT, and LR) show similar performance patterns
although lower performance overall. We show the results of applying special techniques
to imbalanced parts of the dataset in the next subsection.
9.4.5 Classification Performance for Dataset II
Table 29.
Classification Performance of Dataset II by DTA Approach
Without sector information With sector information**
Data SetOverall
performance ANN BN DT LR ANN BN DT LR
Training*
Precision 0.280 0.142 0.119 0.353 0.361 0.277 0.318 0.398Recall 0.227 0.277 0.467 0.220 0.443 0.520 0.403 0.410
False positive rate 0.031 0.088 0.182 0.021 0.041 0.071 0.045 0.033F1 0.250 0.188 0.190 0.271 0.398 0.362 0.356 0.404
Accuracy 0.932 0.880 0.801 0.941 0.933 0.908 0.927 0.940AUC 0.753 0.703 0.656 0.756 0.870 0.863 0.740 0.865
138
![Page 139: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/139.jpg)
Test
Precision 0.268 0.125 0.090 0.322 0.372 0.262 0.283 0.380Recall 0.220 0.240 0.400 0.190 0.420 0.430 0.360 0.380
False positive rate 0.032 0.088 0.213 0.021 0.037 0.064 0.048 0.033F1 0.242 0.164 0.147 0.239 0.394 0.326 0.317 0.380
Accuracy 0.931 0.878 0.768 0.940 0.936 0.911 0.923 0.938AUC 0.736 0.672 0.610 0.723 0.858 0.853 0.741 0.834
* Results of training set are based on the best performance on F1 with parameter
tuning.
** Company’s sector used in Yahoo! finance is included as an attribute
Table 29 reports precision, TP rate (recall), FP rate, F1, accuracy, and AUC on
training and testing sets for each classification method using the DTA approach. Each
bold number in the table indicates the best performance for a measurement across the
four classification models for the testing set. Since we have four pairs of training (1500
instances) and testing (500 instances) sets, we generate and report overall performance
with the following equations which are based on definitions in equations 1 to 5.
(16)
(17)
139
![Page 140: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/140.jpg)
(18)
(19)
(20)
TP, TN, FP, and FN are defined the same as those in subsection 5.1, and the
subscription i is a number between 1 and 4 which denote four disjoint testing sets from
dataset II.
This Table contains results for the same dataset with and without using sector
information (sector was encoded as a variable by nine categorical values). Using sector
information greatly improves the classification performance for dataset II cross the
four classifiers - for example, the maximum F1 measures (both produced by ANN)
increased by 63%. With sector information we do not observe a significant different on
F1 measure across the 20 baskets in dataset I (two tailed t-test p=0.827), indicating that
sector information is more helpful for imbalanced dataset than for more balanced
dataset. We find that for all 316 competitor pairs in dataset III (216 in the 17 sample
baskets of dataset I and 100 in dataset II), a total of 282 (89.2%) pairs are in the same
sector and 34 (10.8%) are not.
140
![Page 141: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/141.jpg)
The EU approach with MVC and MVP produces similar results as those in
Table 29. For instance, for MVC the maximum values of F1 measures are 0.381 and
0.204 respectively with and without sector information. Although EU approach is more
complex than the simple DTA approach – undersampling majority to form multiple
smaller datasets and adjusting positive ratios in these small datasets, the latter produces
results as good as those by the former in our study. Thus in the following Section 6,
while estimating to what extent our approach extends a gold standard, we use the
results from the DTA approach.
Table 29 shows ANN has the largest AUC values. The following Figure 31
illustrates ROC curves for the four classifiers using sector information. The ROC
curves for ANN, BN and LR are close to each other and ANN and LR are all above
(always slightly outperform) the DT’s curve. The diagonal line is generated by
randomly labeling instances with different likelihoods. For example, when the
classifier randomly guesses the positive class 10% of the time, it is expected to find
10% of the positives correct, having a TP rate of 0.1. At the same time, it identifies
90% of the negatives correct, leading to a FP rate at 0.1 (1-0.9) as well. So the process
of guessing the positive class 10% of the time yields the point (0.1, 0.1) in the ROC
space. Following this example, random guess with all different likelihoods generates
the diagonal line.
141
![Page 142: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/142.jpg)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FP rate
TP r
ate
ANN BN DT LR
DTANN
BNLR
Figure 31. ROC Curves of Dataset II for Four Classification Methods
9.4.6 Estimated Overall Classification Performance on the Basis of Dataset III
All of our classification performance measurements until now have been
computed for each sample basket. Since sample baskets consist of random samples of the
original (larger) baskets, the performance results are representative of the performance on
those original baskets. However, we would now like to estimate the classification
performance for all of the baskets combined, the whole dataset with the 87,340 pairs. The
estimation would require us to extrapolate the performance observed on sample baskets
to the entire original basket. Hence, we adopt the following equations to estimate overall
precision, TP rate (recall), FP rate, F1, and accuracy using dataset III. For the 17 sample
baskets from dataset I, the classification results are produced by 10-fold cross validation.
For the 18th sample basket we use the results generated from four disjoint testing sets
142
![Page 143: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/143.jpg)
(each with 500 instances). Since the 2000 pairs are proportionally sampled from the four
of the 21 baskets, we can combine the results of the four disjoint test set as one
“combined basket” (new sample basket #18) in the following equations.
(21)
(22)
(23)
(24)
(25)
Where Bi is the size of basket i, and Si is the size of sample basket i.
143
![Page 144: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/144.jpg)
Hence, the above equations estimate the overall classification performance by
extending performance measurements for a sample basket to the corresponding full
basket and then combining the measurements across the 18 baskets in dataset III. For
example, if the sample basket Si, which represents the original basket Bi,, contains k
classified positives, the original basket Bi can be expected to contain classified
positives. We note that the above equations estimate the overall classification
performance for the whole dataset of 87,340 pairs. So the estimation indicates the
performance of an ensemble of 18 classifiers (one for each basket) based on a given
classification method. The estimated overall prior probability for positives is 11.8%
(about 1 in 9 pairs in the original data set is a competitor pair). We note that ANN has the
best performance on more metrics than other three methods. However, different from
ANN, DT, or BN, LR does not require any parameter turning and produces comparable
good results. We highlight the best performance value under each measurement.
Table 30.
Estimated Overall Performances
Without sector information With sector information Precision Recall FP rate F1 Accuracy Precision Recall FP rate F1 AccuracyANN 0.419 0.378 0.046 0.397 0.907 0.450 0.513 0.055 0.479 0.910BN 0.238 0.354 0.095 0.284 0.863 0.388 0.514 0.071 0.442 0.895DT 0.167 0.463 0.203 0.245 0.770 0.432 0.457 0.053 0.444 0.907LR 0.388 0.330 0.046 0.357 0.904 0.382 0.437 0.062 0.407 0.897
9.5 Competitor Extension
144
![Page 145: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/145.jpg)
In the introduction, we noted with an anecdote that our gold standards are
expected to be incomplete. We now suggest metrics to estimate (11) the coverage of
competitors by a gold standard and (12) the extent to which our approach extends each
gold standard.
9.5.1 Estimating the Coverage of a Gold Standard
Figure 32. Competitors Covered by Two Gold Standards
We will need the following notation (from Figure 32) to describe the estimation
procedure:
C: the (unknown) complete set of competitor pairs
H: the set of competitor pairs covered by Hoover’s
M: the set of competitor pairs covered by Mergent
JHM = H M, the intersection of H and M
Following the idea proposed in the highly cited study [Lawrence and Giles 1998]
to estimate the coverage of search engines, assuming H and M to be independent subsets
145
![Page 146: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/146.jpg)
of C, we can estimate to what extent H covers C based on how much of H covers M (i.e.,
JHM) and the size of M. We therefore define the coverage of the entire competitor set C by
Hoover’s ( ) and Mergent ( ) as follows:
Cov(H) = (26)
Cov(M) = (27)
If H and M are not completely independent, it is apparent that the value of JHM
(their intersection) would tend to be larger than when they are independent. Hence, we
may overestimate the coverage of a gold standard and the coverage estimation can be
considered an upper bound on true coverage of the gold standard.
We have previously labeled the positive instances using Hoover’s and Mergent
for each of the sample baskets. Hence, we can compute the number of competitor pairs
identified by Hoover’s ( ) and Mergent ( ) separately as well as the intersection of
Hoover’s and Mergent ( ) for the ith sample basket. In a manner similar to defining
the equation 11 in subsection 5.6, we estimate the number of positives (for Hoover’s,
Mergent, and their intersection) in each original basket by multiplying the number of
positives in the sample basket by the ratio of basket size over the sample basket size.
Then, based on equations (26) and (27), we calculate the coverage of Hoover’s and
Mergent as follows.
146
![Page 147: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/147.jpg)
(28)
(29)
We find that the estimated coverage of Hoover’s and Mergent is 46.0% and
24.9%, respectively. Our estimation shows that while Hoover’s covers almost twice as
many competitor pairs as Mergent, both data sources individually cover less than 50% of
all competitor pairs.
9.5.2 Estimating the Extension of Our Approach to a Gold Standard
147
![Page 148: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/148.jpg)
Figure 33 Competitors Covered by Two Gold Standards and Our Approach
We now present a procedure to estimate how much our automated approach can
extend a gold standard. Our estimation procedure uses the following notation:
O: the set of competitor pairs classified by our approach
O = C – O
H = C – H
M = C – M
JHMO = H M O
JHMO = H M O
JHMO = H M O
JHMO = H M O
JHMO is a subset of competitor pairs that are classified positive by our approach and
confirmed to be positive by Mergent. However, these pairs are not identified as
competitors by Hoover’s. Since Mergent is a sample of all competitor pairs, we estimate
the extent to which our approach extends Hoover’s (Ext(O, H)) as follows:
Ext(O, H) = (30)
Similarly, we estimate the extent to which our approach extends Mergent (Ext(O,
M)) as follows:
148
![Page 149: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/149.jpg)
Ext(O, M) = (31)
Based on equations (30) and (31), we compute the expansion of our approach to
each gold standard using results from dataset III with the following equations.
Ext(O, H) = (32)
Ext(O, M) = (33)
Table 31.
Extensions to a Gold Standard
Without sector information With sector informationANN BN DT LR ANN BN DT LR
Ext(O, H) 5.9% 7.3% 15.3% 5.0% 12.1% 11.3% 10.1% 10.5%Ext(O, M) 28.7% 23.4% 37.2% 24.3% 33.8% 37.1% 35.8% 32.9%
Table 31 shows the estimation of how much our approach extends over the
knowledge available in each of the two gold standards. The results are shown for
149
![Page 150: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/150.jpg)
different classification methods, and also with and without the use of sector information.
Using the sector information, our approach, with any classification method, extends
Hoover’s and Mergent by over 10% and 32% respectively. We note that the values of our
extension to a gold standard are based on classification results that have been generated
using a set of input parameters and classification methods. Just as the ROC curves in
Figure 31 illustrate, we could achieve higher TP rate (recall) by adjusting some of those
parameters, therefore obtaining higher values for our expansion to a gold standard, but at
the cost of higher FP rate, which leads to lower precision. The expansion results in Table
31 are associated with estimated overall precision, recall, and FP rate at 0.419, 0.378, and
0.046 (without sector information) and 0.450, 0.513, and 0.055 (with sector information)
respectively as shown in Table 30 by ANN classifier.
9.6 Explorations of Competitors vs. Non-competitor Pairs
In next two subsections, we report more exploration results on structural
equivalence similarity between competitor and non-competitor pairs, and on company
annual revenues between competitor pairs with high and low DWIOD values.
9.6.1 SE Similarity Comparison between Competitor and Non-competitor Pairs
For each sample basket of dataset III, we compute and compare the average SE
similarities for competitor and non-competitor pairs. Figure 34 compares DWID-based
SE similarities of the 18 sample baskets in dataset III. Except for the last basket with the
150
![Page 151: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/151.jpg)
smallest DWIOD values, the average SE similarities for competitor pairs are greater than
those for non-competitor pairs (two-tailed t-test, p=0.003), which indicates that on
average competitor companies are more structurally equivalent than non-competitors.
Similar patterns are observed for DWOD- and DWIOD-based SE similarities (two-tailed
t-test, p=0.008 and p=0.001 respectively).
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Basket
Competitor Non-competitor
Figure 34. Average DWID-based SE Similarity Comparison
9.6.2 Comparing Annual Revenues between Competitor Pairs with High and Low
DWIODs
We observe that the average revenue of company pairs with low DWIOD values
(100 competitor pairs in dataset II) is significantly (two-tailed t-test, p<0.001) lower than
151
![Page 152: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/152.jpg)
the average revenue of company pairs with high DWIOD values (92 competitor pairs in
the first five sample baskets, #1 - # 5, in dataset I.)
9.7 Discussions
Given that news portals organize news stories by company and a news story
pertaining to a company often cites several other companies, we consider such company
citations as outlinks from the source company to target companies and construct a
directed and weighted intercompany network. On the basis of 60,532 news stories from
an 8-month period collected from Yahoo! Finance, the network consists of 6,428
companies as nodes and 87,340 links (company pairs). Using SNA techniques we
identify four types of attributes from the network structure. One of the attributes, dyadic
weighted inoutdegree (DWIOD), which captures the notion of overall volume of citations
between two companies in news, is used to split our data set of 87,340 company pairs
into 21 baskets. We generate three datasets – Dataset I consists of 840 pairs randomly
sampled from the 21 baskets, dataset II has 2,000 pairs and represents an imbalanced
portion of the whole set where the number of competitor pairs is much fewer than that of
non-competitor pairs, and dataset III is generated from the first two. We use two
company profile Web sites, Hoover’s and Mergent, as gold standards to identify/label a
company pair as competitors or not.
Before conducting classification, with dataset I we first examine the competitor
coverage and competitor density of the intercompany network by comparing with those
in an exhaustive network (clique) and a random network. We find that the intercompany
152
![Page 153: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/153.jpg)
network covers 66-100% known competitor pairs and at the same time its size (in number
of links) is 15-84% of the clique; the competitor density in the intercompany network is
several (2-52) times higher than that of the random network.
We employ four classification models (Artificial Neural Network, Bayes Net,
Decision Tree, and Logistic Regression) to classify competitor relationships and report
classification performance on the basis of 10-fold cross validation. The results on
individual sample baskets in dataset I reveals that typical classification methods fail to
detect the minorities in imbalanced datasets. Thus with dataset II we compare two
approaches that are capable of handing imbalanced dataset problem, and report their
performance. With dataset III we estimate the overall performance for the whole dataset
on the basis of the classification results from both datasets I and II. Besides, as another
aspect of this research we estimate to what extent a gold standard covers the whole
competitor space in competitor coverage. Finally we present metrics and estimate to what
extent our automatic approach can extend a gold standard.
In summary we present a data mining approach of using company citations in
news to discover competitor relationships. Many times the company citations seem to be
random, such as just co-occurrence, however, with large number of news articles, we can
discover meaningful business relationships with out approach.
153
![Page 154: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/154.jpg)
REFERENCES
Adamic, L. A. 2002. Zipf, Power-laws, and Pareto - A Ranking Tutorial. http://ginger.hpl.hp. com/shl/papers/ranking/ranking.html.
Barábasi, A. L., R. Albert, H. Jeong. 2000. Scale-Free Characteristics of Random Networks The Topology of the World Wide Web. Physica A, 281 69-77.
Bernstein, A., S. Clearwater, S. Hill, F. Provost. 2002. Discovering Knowledge from Relational Data Extracted from Business News. In Proceedings of the KDD 2002 Workshop on Multi-Relational Data Mining, Edmonton, Alberta, Canada.
Brandes, U. 2001. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology, 25(2) 163-177.
Brin, S., L. Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1-7) 107-117.
Broder, A. 2002. A Taxonomy of Web Search. ACM SIGIR Forum, 36(2) 3-10.
Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. L. Wiener. 2000. Graph Structure in the Web. In Proceedings of the 9th World Wide Web Conference, 309-320.
Budzik, J., K. HAMMOND. 2000. User Interactions with Everyday Applications as Context for Just-in-time Information Access. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, 44-51.
Butler, D. 2000. Souped-up Search Engines. Nature, 405, 112-115.
Carroll, J., M. B. Rosson. 1987. The Paradox of the Active User. In Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, J.M. Carroll, Ed. MIT Press, Cambridge, MA.
Chan, P., S. Stolfo. 1998. Toward Scalable Learning with Non-uniform Class and Cost Distributions A Case Study in Credit Card Fraud Detection. Proceedings of the 4th
International Conference on Knowledge Discovery and Data Mining. New York City, NY, 164-168.
Chakrabarti, S., B. E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg. 1999. Mining the Web's link structure. Computer, 32(8), 60-67.
Chawla, N. V., K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 321-357.
154
![Page 155: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/155.jpg)
Chirita, P.A., W. Nejdl, R. Paiu, C. Kohlschűtter. 2005. Using ODP Metadata to Personalize Search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 178-185.
Cooper, G., E. Herskovitz. 1992. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9(4) 309-347.
Craswell, N., D. Hawking, S. Robertson. 2001. Effective Site Finding Using Link Information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, LA, 250-257.
Cutting, D.R., D.R. Karger, J.O. Pedersen, J.W. Tukey. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 318-329.
Deerwester, S., S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6) 391-407.
Dietterich, T.G. 1997. Machine Learning Research: Four Current Directions. AI Magazine, 18(4) 97-136.
Dreilinger, D., A. E. Howe. 1997. Experiences with Selecting Search Engines Using Metasearch. ACM Transactions on Information Systems, 15(3) 195-222.
Dumais S., H. Chen. 2001. Optimizing Search by Showing Results in Context. In Proceedings of Computer-Human Interaction, Seattle, WA, 277-284.
Eirinaki, R., M. Vazirgiannis. 2003. Web Mining for Web Personalization. ACM Transactions on Internet Technology, 3(1), 1-27.
Estabrooks, A., N. Japkowicz. 2001. A Mixture-of-experts framework for Learning from Unbalanced Data Sets. Proceedings of the 4th International Symposium on Intelligent Data Analysis. Lisbon, Portugal, 34-43.
Faloutsos, M., P. Faloutsos, C. Faloutsos. 1999. On power-law relationships of the Internet topology. In Proceedings ACM SIGCOMM, 251-262.
Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth. 1996. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press, Menlo Park, California, 1–30.
155
![Page 156: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/156.jpg)
Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin. 2002. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1) 116-131.
Freeman, L. C. 1979. Centrality in Social Networks: Conceptual Clarification. Social Networks, 1, 215-239.
Friedman, N., D. Geiger, M. Goldszmidt. 1997. Bayesian Network Classifiers. Machine Learning 29(2-3) 131-163.
Garfield, E. 1979. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York.
Gauch, S., J. Chaffee, A. Pretschner. 2003. Ontology-based Personalized Search and Browsing. Web Intelligence & Agent Systems, 1(3/4) 219-234.
Giles, C. L., K. Bollacker, S. Lawrence. 1998. CiteSeer: An Automatic Citation Indexing System. In Proceedings of the 3rd ACM Conference on Digital Libraries, Pittsburgh, PA, USA, 89-98.
Glover, E., S. Lawrence, W. Brimingham, C. L. Giles. 1999. Architecture of a Metasearch Engine that Supports User Information Needs. In Proceedings of the 8th International Conference on Information Knowledge Management, Kansas City, MO, 210-216.
Gulati, R., M. Gargiulo. 1999. Where Do Interorganizational Networks Come From? American Journal of Sociology, 104(5), 1439-1493.
Hafri, Y., C. Djeraba. 2004. Dominos: A New Web Crawler’s Design. In Proceedings of the 4th International Web Archiving Workshop (IWAW), Beth, UK.
Hair, J. F., W. C. Black, B. J. Babin, R. E. Anderson, R. L. Tatham. 2006. Multivariate Data Analysis. 6th edition, Pretice Hall.
Hammer J., H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. 1997. Extracting Semistructured Information from the Web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, AZ, 18-25.
Harris, Z. 1985. Distributional Structure. In The Philosophy of Linguistics. Katz, J.J., Ed. Oxford University Press, New York, 26-47.
Haveliwala, T.H. 2003. Topic-Sensitive PageRank. IEEE Transactions on Knowledge and Data Engineering, 15(4) 784-796.
Jansen, B.J., A. Spink, J. Bateman, T. Saracevic. 1998. Real Life Information Retrieval: A Study of User Queries on the Web. ACM SIGIR Forum. 32(1) 5-17.
156
![Page 157: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/157.jpg)
Jansen, B. J., A. Spink, T. Saracevic. 2000. Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing and Management, 36(2) 207-227.
Jansen, B. J., A. Spink, J. Pederson. 2005. A Temporal Comparison of AltaVista Web Searching. Journal of the American Society for Information Science and Technology, 56( 6) 559-570.
Jansen, B. J., A. Spink. 2005. An Analysis of Web Searching by European AlltheWeb.com Users. Information Processing and Management, 41, 361-381.
Jeh, G., J. Widom. 2003. Scaling Personalized Web Search. In Proceedings of the 12th international conference on World Wide Web, Budapest, Hungary, 271 - 279.
JUNG. 2006. Java Universal Network/Graph Framework (ver. 1.7.4). http://jung.sourceforge.net
Käki, M. 2005. Findex: Search Result Categories Help Users when Document Ranking Fails. In Proceedings of the SIGCHI conference on Human factors in computing systems, Portland, OR, 131-140.
Kessler, M. M. 1963. Bibliographic Coupling between Scientific Papers. American Documentation 24 123-131.
Kleinberg, J. 1999. Authoritative Sources in a Hyperlinked Environment. Journal of ACM, 46(5) 604-632.
Knoblock, C. A. , S Minton, J. L. Ambite, N Ashish, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, S. Tejada. 1998. Modeling Web Sources for Information Integration. In Proceedings of the 15th National Conference on Artificial Intelligence, Madison, WI, 211-218.
Kotsiantis, S., D. Kanellopoulos, P. Pintelas. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1)Lawrence, S., C. L. Giles. 1998. Searching the World Wide Web. Science 280 (3) 98-100.
Kraft, R., F. Maghoul., C. C. CHANG. 2005. Y!Q: Contextual Search at the Point of Inspiration. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, 816-823.
Kumar, R., P. Raghavan, S. Rajagopalan, A. Tomkins. 1999. Trawling the Web for Emerging Cyber-Communities. Computer Networks, 31(11-16) 1481-1493.
157
![Page 158: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/158.jpg)
Lawrence, S. 2000. Context in Web Search. IEEE Data Engineering Bulletin, 23(3) 25-32.
Leory, G., A. M. Lally, H. Chen. 2003. The Use of Dynamic Contexts to Improve Casual Internet Searching. ACM Transactions on Information Systems, 21(3) 229-253.
Levine, J. H. 1972. The Sphere of Influence. American Sociological Review, 37(1) 14-27.
Liu, B. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 1st
edition, Springer.
Liu, F., C. Yu, W. Meng. 2004. Personalized Web Search for improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1) 28-40.
Lorrain, F., H. G. White. 1971. Structural Equivalence of Individuals in Social Networks. Journal of Mathematical Sociology 1 49-80.
Maltz, D., K. Ehrlich. 1995. Pointing the way: active collaborative filtering. In Proceedings of the Conference on Computer-Human Interaction, Denver, CO, 202-209.
Michael, T. 1997. Machine Learning. WCB/McGraw-Hill.
Menczer, F., G. Pant, P. Srinivasan. 2004. Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology, 4(4) 378-419.
Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, K. J. Miller. 1990. Introduction to WordNet: an On-line Lexical Database. International Journal of Lexicography, 3(4) 235-244.
Najork, M., A. Heydon. 2001. High-performance Web Crawling. In Handbook of Massive Data Sets, J. ABELLO, P. PARDALOS, AND M. RESENDE, Eds. Kluwer Academic Publishers, 25-45.
Oyama, S., T. Kokubo, T. Ishida. 2004. Domain-Specific Web Search with Keyword Spices. IEEE Transactions on Knowledge and Data Engineering, 16(1) 17-27.
Padmanabhan, B., Z. Zheng, S. Kimbrough. 2006. An Empirical Analysis of the Value of Complete Information for eCRM Models. MIS Quarterly, 30(2) 247-267.
Palmer J. W., J. P. Bailey, S. Faraj. 2000. The Role of Intermediaries in the Development of Trust on the WWW: The Use and Prominence of Trusted Third Parties and Privacy Statements. Journal of Computer-Mediated Communication, 5(3).
Park, H. W. 2003. Hyperlink Network Analysis: A New Method for the Study of Social Structure on the Web. Connections, 25(1) 49-61.
158
![Page 159: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/159.jpg)
Pazzani, M., Merz, C., P. Murphy. 1994. Reducing Misclassification Costs. Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ, USA, 217-225
Pitkow, J., H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, T. Breuel. 2002. Personalized Search. Communication of the ACM, 45(9) 50-55.
Porter, M. 1980. An Algorithm for Suffix Stripping. Program, 14(3) 130-137.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA.
Richards, W. D., G. A. Barnett (Eds.) 1993. Progress in Communication Science, 12, Ablex Pub. Corp., Norwood, NJ.
Riloff, E., J. Shepherd. 1997. A Corpus-based Approach for Building Semantic Lexicons. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RI, 117-124.
Salton, G., M. J. McGill. 1986. Introduction to Modern Information Retrieval, McGraw-Hill, New York.
Salzberg, S. 1997. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1 317-327.
Schapire, R. E. 1999. A Brief Introduction to Boosting. Proceedings of the 16th
International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 1401-1406.
Scott, J. 2000. Social Network Analysis: A Handbook, 2nd ed., Sage Publications, London.
Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1) 1-47.
Sellen, A.J., R. Murphy, K. L. Shaw. 2002. How Knowledge Workers Use the Web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing our World, Changing Ourselves. Minneapolis, MN, 227-234.
Small, H. 1973. Co-citation in the Scientific Literature: A New Measurement of the Relationship between Two Documents. Journal of the American Society of Information Science 24(4) 265-269.
Shakes, J., M. Langheinrich, O. Etzioni. 1997. Dynamic Reference Sifting: A Case Study in the Homepage Domain. In Proceedings of the 6th International World Wide Web Conference, Santa Clara, CA, 189-200.
159
![Page 160: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/160.jpg)
Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma. 2004. Web-page Classification through Summarization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, South Yorkshire, UK, 242- 249.
Shen, X., B. Tan, C. X. Zhai. 2005a. Context-Sensitive Information Retrieval Using Implicit Feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil, 43-50.
Shen, X., B. Tan, C. X. Zhai. 2005b. Implicit User Modeling for Personalized Search. In Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 824-831.
Speretta, M., S. Gauch. 2005. Personalizing Search Based on User Search Histories. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, Compiegne University of Technology, France, 622 - 628.
Srinivasan, P., F. Memczer, G. Pant. 2005. A General Evaluation Framework for Topical Crawlers. Information Retrieval, 8(3) 417-447.
Sugiyama, K., K. Hatano, M. Yoshikawa. 2004. Adaptive Web Search Based on User Profile Constructed without Any Effort from Users. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, 675-684.
Sullivan, D. 2000. NPD Search and Portal Site Study. Search Engine Watch: http://searchenginewatch .com/sereport/article.php/2162791.
Tan, A. H. 2002. Personalized Information Management for Web Intelligence. In Proceedings of World Congress on Computational Intelligence, Honolulu, HI, 1045-1050.
Tan, A. H., C. Teo. 1998. Learning User Profiles for Personalized Information Dissemination. In Proceedings of International Joint Conference on Neural Network, Anchorage, AK, 183-188.
Teevan, J., S. T. Dumais, E. Horvitz. 2005. Personalizing Search via Automated Analysis of Interests and Activities. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 449-456.
Uzzi, B. 1999. Embeddedness in the Making of Financial Capital: How Social Relations and Networks Benefit Firms Seeking Financing. American Sociological Review, 64, 481-505.
160
![Page 161: Web mining is the application of data mining techniques to ...zma/research/dissertation proposal chapters... · Web viewThe main findings include (1) PCAT is better than LIST for](https://reader035.vdocuments.us/reader035/viewer/2022070611/5b190c687f8b9a28258c4ff1/html5/thumbnails/161.jpg)
Walker, G., B. Kogut, W. Shan. 1997. Social Capital, Structural Holes and the Formation of an Industry Network. Organization Science, 8(2) 109-125.
Wasserman, S., K. Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK.
Weiss, G. M. 2004. Mining with Rarity: A Unifying Framework. Sigkdd Explorations 6(1) 7-19.
Weiss, G. M., F. Provost. 2003. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19 315-354.
Wen, J.R., J. Y. Nie, H. J. Zhang. 2002. Query Clustering Using User Logs. ACM Transactions on Information Systems, 20(1) 59-81.
Witten, I. H., E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., Morgan Kaufmann, San Francisco.
Xu, J., W.B. Croft. 1996. Query Expansion Using Local and Global Document Analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 4-11.
Yang, Y., X. Liu. 1999. A Re-Examination of Text Categorization Methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, 42-49.
Zaïane, O. R., M. Xin, J. Han. 1998. Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. In Proceedings of Advances in Digital Libraries, Santa Barbara, CA, 19-29.
Zamir, O., O. Etzioni. 1999. Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31(11-16) 1361-1374.
161