web mining is the application of data mining techniques to ...zma/research/dissertation proposal...

Dissertation Proposal

WEB MINING FOR KNOWLEDGE DISCOVERY

Zhongming Ma

Ph.D. Candidate in Information Systems

School of Accounting and Information Systems

David Eccles School of Business

The University of Utah

Co-chairs

Dr. Gautam Pant and Dr. Olivia Sheng

Committee members

Dr. Paul Hu

Dr. Ellen Riloff

Dr. Wei Gao

1

TABLE OF CONTENTS

ABSTRACT

1. DISSERTATION PROPOSAL1.1 Knowledge Discovery on the Web1.2 Personalized Search1.3 Business Relationship Discovery1.4 Overview of Dissertation1.5 Proposed Plan

DRAFT DISSERTATION PART I PERSONALIZED SEARCH

2 INTRODUCTION AND LITERATURE REVIEW2.1 Introduction2.2 Literature Review

2.2.1 Query Expansion2.2.2 Result Processing2.2.3 Representing Context Using Taxonomy2.2.4 Taxonomy of Web Activities2.2.5 Text Categorization

3 OUR APPROACH3.1 Step 1: Obtaining an Interest Profile3.2 Step 2: Generating Category Profiles3.3 Step 3: Mapping Interests to ODP Categories

3.3.1 Mapping Method 1: Simple Term Match3.3.2 Mapping Method 2: Most Similar Category Profile3.3.3 Mapping Method 3: Most Similar Category Profile while Augmenting Interest with Potentially Related Nouns3.3.4 Mapping Method 4: Most Similar Category Profile while Augmenting Interest with Potentially Related Noun Phrases

3.4 Step 4: Resolving Mapped Categories3.5 Step 5: categorizing Search Results3.6 Implementation

4 EXPERIMENTS4.1 Studied Domains and Domain Experts4.2 Professional Interests, Search Tasks, and Query Length

4.2.1 Professional Interests (Interest Profiles)4.2.2 Search Tasks4.2.3 Query Length

4.3 Subjects4.4 Experiment Process

2

5 EVALUATIONS AND DISCUSSIONS5.1 Comparing Mean Log Search Time by Query Length5.2 Comparing Mean Log Search Time for Information Gathering Tasks5.3 Comparing Mean Log Search Time for Site Finding Tasks5.4 Comparing Mean Log Search Time for Finding Tasks5.5 Questionnaire and Hypotheses

5.5.1 Questionnaire5.5.2 Hypotheses

5.6 Hypothesis Test Based on Questionnaire5.7 Comparing Indices of Relevant Results5.8 Discussions

DRAFT DISSERTATION PART II BUSINESS RELATIONSHIP DISCOVERY

6 INTRODUCTION AND LITERATURE REVIEW6.1 Introduction6.2 Literature Review

7 NETWORK-BASED ATTRIBUTES AND DATA7.1 Notation in Directed Graphs7.2 Notation in Directed, Weighted Graphs

7.2.1 Dyadic and Node Degree-based Attributes7.2.2 Centrality-based Attributes7.2.3 Structural Equivalence (SE) based Attributes

7.4 Raw Data7.5 Preliminary Data Processing7.6 Node and Link Identification7.7 Attribute Distributions

7.7.1 Node Indegree Distribution7.7.2 Link Weight Distribution7.7.3 Revenue Distribution7.7.4 Revenue Node Weighted Indegree Distribution

8 PREDICTING COMPANY REVENUE RELATIONS (CRR)8.1 Measurements of CRR8.2 Research Questions8.3 Research Methods

8.3.1 Classification Methods8.3.2 Discriminant Analysis with Logistic Regression

8.4 Results and analyses8.4.1 Positive CRR and Top Links by DWND8.4.2 Positive CRR by DWND8.4.3 Predicting CRR8.4.4 Predicting top-N Companies by Revenue8.4.5 Discriminant Variate

8.5 Discussions

3

9 DISCOVERING COMPETITOR RELATIONSHIPS9.1 Approach Outline and Research Questions9.2 Datasets

9.2.1 Dataset I9.2.2 Datasets II and III

9.3 Examining the Competitor Coverage & Competitor Density of the Intercompany Network

9.3.1 Examining the Competitor Coverage9.3.2 Examining the Competitor Density

9.4 Competitor Discovery9.4.1 Evaluation Metrics9.4.2 Classification Methods for Dataset I9.4.3 Classification Methods for Dataset II9.4.4 Classification Performance for Dataset I9.4.5 Classification Performance for Dataset II9.4.6 Estimated Overall Classification Performance for Dataset III

9.5 Competitor Extension9.5.1 Estimating the Coverage of a Gold Standard9.5.2 Estimating the Extension of Our Approach to a Gold Standard

9.6 Explorations on Competitors vs. Non-competitor pairs9.6.1 Comparing SE Similarities between Competitors and Non-Competitor Pairs9.6.2 Comparing Annual Revenues between Competitors and Non-Competitor Pairs

9.7 Discussions

10 CONCLUSIONS

REFERENCES

4

1 DISSERTATION PROPOSAL

1.1 Knowledge Discovery on the Web

Knowledge discovery from databases (KDD) refers to “the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable patters in data”

[Fayyad et al. 1996]. KDD has achieved a broad range of applications including pattern

recognition and predictive analytics in many different areas, such as engineering,

business, and science. Knowledge discovery has two types of goals, verification and

discovery. In general the former goal refers to verifying a user’s hypothesis and the latter

can be further divided into prediction (i.e., predicting unknown or future values) and

description (i.e., presenting identified results such as patterns in a human-understandable

form) [Fayyad et al. 1996].

The Web has become a universal repository with tremendous amount of data that

can be accessed from any where in the world and has experienced continuous growth

both in content and its users. Therefore, the Web presents immense opportunities for

discovering knowledge. However, unlike conventional databases, the data on Web is

mostly unstructured. This situation makes knowledge discovery on Web challenging as

compared to KDD on traditional databases. On the Web, the knowledge discovery

process requires considerable effort on identifying, selecting, and processing data

possibly from multiple sources and in different (often free-form text) formats. Manual

analysis that turns such large volumes of Web data into knowledge is impractical and

thus knowledge discovery on the Web becomes an attempt to address the accentuated

5

problem of data overload. We adapt the KDD process presented in [Fayyad et al. 1996]

for Web mining and present the process of Wed mining for knowledge discovery as

follows.

Figure 1. Process of Web mining for knowledge discovery

Web mining is a step in the KDD process and it aims to analyze data and discover

knowledge from the Web. The Web data includes all kinds of Web documents,

hyperlinks among Web pages, and Web usage logs. Depending on the type of Web data

being mined, Web mining can be broadly divided into three categories: Web content

mining, Web structure mining, and Web usage mining [Srivastava et al. 2000].

Web content mining is the process of discovering knowledge from Web page content

(i.e., often text), and it often uses techniques based on data mining and text mining

[Liu 2006]. Important Web content mining problems include data/information

extraction [e.g. Hammer et al. 1997], Web information integration [e.g. Knoblock et

al. 1998], online opinion extraction, Web search [e.g. Brin and Page 1998],

processing (e.g., clustering or categorizing) search results according to page content

[e.g. Zamir and Etzioni 1999; Dumais and Chen 2001], etc [Liu 2006].

6

Web structure mining tries to discover useful information such as importance of

pages from the structure of hyperlinks on the basis of social network analysis (SNA)

techniques and graph theory. Its research topics cover ranking pages [e.g. Brin and

Page 1998; Chakrabarti el al. 1999], finding Web community [e.g. Gibson et al.

1998], etc.

Web usage mining is the automatic discovery of user access patterns from Web logs

[Cooley et al. 1997]. The identified visit patterns can help in understanding the

overall access patterns and trends for all users [e.g. Zaïane et al. 1998] and allow for

Web site design that is responsive to business goals and customer needs, such as user-

level customization [e.g. Eirinaki and Vazirgiannis 2003].

My dissertation consists of two related topics/parts: personalized search and

business relationship discovery, both of which are in the area of Web mining for

knowledge discovery. The first topic presents and evaluates an automatic personalized

search framework that categorizes search results under user’s interests in order to

examine how the proposed personalized search approach outperforms non-categorized

and non-personalized baseline systems. This research is of Web content mining. The

second topic proposes an approach to identifying an intercompany network using

company citations from Web content (more specifically, online news stories) and

discovers business relationships between companies from the network on the basis of

SNA and machine learning techniques. Therefore the second topic covers both Web

content mining and Web structure mining. The main research question we explore is

whether structural attributes derived from the intercompany network, which in turn is

derived from company citations in online news, can identify business relationships. As

7

shown in Figure 2, at a high level, the first topic connects Web content to people, and the

second uses Web content to discover connections between companies. Thus the two

topics are connected through mining of Web content. However, the two topics generate

different types of knowledge – interest-based personalized search results versus news-

driven inter-company relationships – and hence entail diverse adoptions of Web data,

processing, and Web mining. In the next two sections we briefly introduce the two topics.

Figure 2. Process View of the Two Topics of the Dissertation

1.2 Personalized Search

Most search engines, including the most popular ones such as Google and

Yahoo!, ignore users’ search context, such as users’ interests. As a result, the same query

from different users with different information needs retrieves the same search results

displayed in the same way. Hence, they use a “one size fits all” [Lawrence 2000]

8

approach. We note that currently Google is attempting to address this problem with some

level of voluntary personalization. Personalization techniques that consider users’ context

during search can improve search efficiency [Pitkow et al. 2002]. We propose and

implement an automatic approach to categorizing search results according to a user’s

interests to help users find relevant information and find it quicker. Our approach is

particularly well suited for a workplace scenario where much of the information, needed

by the proposed system, about professional interests and skills of knowledge workers is

available to the employer. Personalizing based on such information within an

organization can be expected to have less privacy concerns as compared to a general

purpose search engine gathering data on user interests. Moreover, unlike other

approaches, our approach does not impose any burden of implicit or explicit feedback

from the user.

Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search

9

We customize the general process of Web mining for KDD in Figure 1 and present the

process of interest-based personalized search for knowledge discovery in Figure 3 where

processes covered by the horizontal double-arrow-lines correspond to equivalent ones in

Figure 1. The proposed approach includes a mapping framework that automatically maps

user interests into a group of categories from Open Directory Project (ODP) taxonomy.

A text classifier is built from the content of the mapped ODP categories and later is used

at query-time to categorize search results under user interests. For a workplace scenario

where the employees’ professional interests and skills can be automatically extracted

from their resume or company’s database, this approach is fully automatic in that users

do not need to provide implicit or explicit feedbacks during the search. Also the use of

ODP is transparent to the users. The lack of explicit or implicit feedback and the use of

ODP taxonomy without a user’s awareness of it differentiates this work from many

others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In addition, we

study three search systems with different interfaces for displaying search results. The first

system (LIST) shows search results in a page-by-page list. The second (CAT) categorizes

and displays results under certain ODP categories. The third (PCAT) is what we propose,

and PCAT categorizes and displays results under user interests. We compare the PCAT

with LIST and PCAT with CAT on the basis of different query lengths and different

types of search tasks.

Contributions of this research are that we present an automatic approach to

personalizing Web searches given a set of user interests. The main

findings include (1) PCAT is better than LIST for one word query and Information

Gathering type of task, and PCAT outperforms CAT for free-form queries and for both

10

Information Gathering and Finding types of tasks in terms of the time spent on finding

relevant results. We conclude that there is not any system universally better than others –

the performance of a system depends on some parameters such as query length and type

of task.

1.3 Business Relationship Discovery

Business news contains rich and current information about companies and the

relationships among them. Reading news is very time consuming and requires a reader to

possess certain skills, the most basic of which is a good understanding of the language in

which the news is written. The huge volume of news stories makes the manual

identification of relationships among a large number of companies nontrivial and

unscalable. The previous literature using news to automatically discover business

relationships among companies is sparse. Many researchers in areas such as organization

behavior and sociology employ SNA techniques to investigate the nature and

implications of business relationships on the basis of explicitly specified company

relationships provided by reliable data sources [e.g. Levine 1972; Walker et al. 1997;

Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and

computer science tend to identify links between nodes using implicit signals, such as

article citations, URL links, and email communications, derived from large and noisy

data sources. They study problems such as identifying importance of individual nodes

(e.g., Web pages, journal articles) in a network [e.g. Garfield 1979; Brin and Page 1998;

Kleinberg 1999] and finding communities on the Web [e.g. Kautz et al. 1997; Gibson et

11

al. 1998], instead of discovering business relationships between companies. We present

an approach of automatic discovery of company relationships from online business news

using machine learning and SNA techniques. Figure 4 illustrates the knowledge

discovery process for business relationship discovery from Web data (i.e., online news).

Figure 4. Knowledge Discovery Process for Business Relationship Discovery

Given that a news story pertaining to a company often cites one or more other

companies, we construct a directed and weighted intercompany network on the basis of

the citations from a large amount of online news by considering company citations as

directed links from the focal companies to the cited companies. Further we identify four

types of attributes from the network structure using SNA techniques. More specifically

they are dyadic degree based-, node degree based-, node centrality based-, and structural

equivalence based-attributes. Those attributes differ in their coverage of the network.

With those network attributes, we study two types of company relationships using

machine learning methods. This news-driven, SNA-based business relationship discovery

12

approach is scalable and language-neutral. Research along this line consists of two

studies that differ in their target business relationships and we describe them as follows.

The first one concentrates on predicting a company revenue relation (CRR). Given a pair

of companies, CRR refers to the relative size of two companies’ annual revenues. We

find that degree-based and centrality-based attributes derived from network structure can

predict CRR with reasonable precision, recall, and accuracy (all above 70%) for all

directly linked company pairs in the network.

Contributions of this study are that (1) our approach can serve as a data filtering

step for studying the revenue relations among very large number of companies. (2) Since

the revenue information for public companies is available quarterly, our approach can be

used as a prediction tool for revenues. (3) Our approach can be applied to discover the

revenue relations for private or foreign companies as well.

In the second work we study competitor relationship between companies. We

discover the competitor relationship between a pair of connected companies in the

intercompany network on the basis of the four types of attributes. And in particular, we

study the classification of company pairs for imbalanced data set where the number of

competitor pairs is much smaller than that of non-competitor pairs. We use two gold

standards: Hoovers.com and Mergentonline.com that are professional company profile

websites and contain manually identified competitors for each company to evaluate the

classification performance of our approach. Given that neither of the gold standards is

complete in the coverage of competitors, we estimate the coverage of each gold standard.

Finally we present metrics to estimate how much our approach can extend each of the

gold standards.

13

Contributions of this work include that we present an automatically approach to

discovering competitor relationships between companies. Our approach is particularly

useful to serve as an initial data filtering step to identify a group of potential competitors

for each of many companies. We study an imbalanced dataset problem and report the

classification performance for competitor pairs in both the imbalanced dataset and the

whole dataset. Most important, we report the estimated extension of our approach to each

of two gold standards.

1.4 Overview of Dissertation

At high level the dissertation is organized as follows. Part I, which consists of

chapters 2 to 5, is for the first topic of the dissertation: Interested-based Personalized

Search. Part II, which includes chapters 6 to 9, covers the two related studies in business

relationship discovery. More specifically we highlight each chapter as follows.

Chapter 2 introduces the research on personalized search and reviews related prior

work. We detail our approach of personalized search in Chapter 3. Experiments are

covered in Chapter 4 and result analyses and conclusions are discussed in Chapter 5. For

the topic of business relationship discovery, we introduce it and review prior literature in

Chapter 6. Chapter 7 describes how to identify attributes from the network structure and

explain the data and data processing procedures. We concentrate predicting CRR in

Chapter 8 and discovering competitor relationships in Chapter 9. Finally we conclude the

dissertation in Chapter 10.

14

1.5 Proposed Plan

The time line of my dissertation is as follows.

Feb. 13, 2007 Proposal defense

Mar. 16, 2007 Sending dissertation draft to committee members and to Thesis

Office for format approval

Mar. 30, 2007 Update on the dissertation draft

Apr. 3 or 10, 2007 Dissertation defense

15

DRAFT DISSERTATION PART I INTEREST-BASED PERSONALIZED SEARCH

2 INTRODUCTION AND LITERATURE REVIEW

2.1 Introduction

The Web provides an extremely large and dynamic source of information, and the

continuous creation and updating of Web pages magnifies information overload on the

Web. Both casual and non-casual users (such as knowledge workers) often use search

engines to find a “needle” in this constantly growing “haystack.” Sellen et al. [2002],

who define a knowledge worker as someone “whose paid work involves significant time

spent in gathering, finding, analyzing, creating, producing or archiving information,”

report that 59% of the tasks performed on the Web by a sample of knowledge workers

fall into the categories of Information Gathering and Finding, which require an active use

of Web search engines.

Most existing Web search engines return a list of search results based on a user’s

query but ignore the user’s specific interests and/or search context. Therefore, the

identical query from different users or in different contexts will generate the same set of

results displayed in the same way for all users, a so called “one size fits all” [Lawrence

2000] approach. Furthermore, the number of search results returned by a search engine is

often so large that the results must be partitioned into multiple result pages. In addition,

16

individual differences in information needs, polysemy (multiple meanings of the same

word), and synonymy (multiple words with same meaning) pose problems [Deerwester et

al. 1990], in that a user may have to go through many irrelevant results or try several

queries before finding the desired information. Problems encountered in searching are

exaggerated further when the search engine users employ short queries [Jansen et al.

1998]. However, personalization techniques that put a search in the context of the user’s

interests may alleviate some of these issues.

In this study, which focuses on knowledge workers’ search for information online

in a workplace setting, we assume that some information about the knowledge workers,

such as their professional interests and skills, is known to the employing organization and

can be extracted automatically with an information extraction (IE) tool or database

queries. The organization then can use such information as an input to a system based on

our proposed approach and provide knowledge workers a personalized search tool that

will reduce their search time and boost their productivity.

For a given query, a personalized search can provides different results for

different users or organize the same results differently for each user. It can be

implemented on either the server side (search engine) or the client side (organization’s

intranet or user’s computer). Personalized search implemented on the server side is

computationally expensive when millions of users are using the search engine and also

raises privacy concerns when information about users is stored on the server. A

personalized search on the client side can be achieved by query expansion and/or result

processing [Pitkow et al. 2002]. By adding extra query terms associated with user

interests or search context, the query expansion approach can retrieve different sets of

17

results. The result processing includes result filtering, such as removal of some results,

and reorganizing, such as re-ranking, clustering, and categorizing the results.

Our proposed approach is a form of client-side personalization based on an

interest-to-taxonomy mapping framework and result categorization. It piggybacks on a

standard search engine such as Google1 and categorizes and displays search results on the

basis of known user interests. As a novel feature of our approach, the mapping

framework automatically maps the known user interests onto a set of categories in a Web

directory, such as the Open Directory Project2 (ODP) or Yahoo!3 directory. An advantage

of this mapping framework is that after user interests have been mapped onto the

categories, a large amount of manually edited data under these categories is freely

available to be used to build text classifiers that correspond to these user interests. The

text classifiers then can categorize search results according to the user’s various interests

at query time. The same text classifiers may be used to categorize e-mails and other

digital documents, which suggests our approach may be extended to a broader domain of

content management.

The main research questions that we explore are as follows: (1) What is an

appropriate framework for mapping a user’s professional interests and skills onto a group

of concepts in an taxonomy such as a Web directory? (2) How does a personalized

categorization system (PCAT) based on our proposed approach perform differently from

a list interface system (LIST), similar to a conventional search engine? (3) How does

PCAT perform differently from a non-personalized categorization system (CAT) that

categorizes results without any personalization? The third question attempts to separate

1 http://www.google.com.2 http://www.dmoz.com.3 http://www.yahoo.com.

18

the effect of categorization from the effect of personalization in the proposed system. We

explore the second and third questions along two dimensions: type of task and query

length.

Figure 5 illustrates the input and output of these three systems. LIST requires two

inputs: a search query and a search engine, and its output, similar to what a conventional

search engine adopts, is a page-by-page list of search results. Using a large taxonomy

(ODP Web directory), CAT classifies search results and displays them under some

taxonomy categories; in other words, it uses the ODP taxonomy as an additional input.

Finally, PCAT adds another input: a set of user interests. The mapping framework in

PCAT automatically identifies a group of categories from the ODP taxonomy as relevant

to the user’s interests. Using data from these relevant categories, the system generates

text classifiers to categorize search results under the user’s various interests at query time.

19

Figure 5. Input and output of the three systems

We compare PCAT with LIST and with CAT in two sets of controlled

experiments. Compared with LIST, PCAT works better for searches with short queries

and for Information Gathering tasks. In addition, PCAT outperforms CAT for both

Information Gathering and Finding tasks and for searches with free-form queries.

Subjects indicate that PCAT enable them to identify relevant results and complete given

tasks more quickly and easily than does LIST or CAT.

2.2 Related Literature

This section reviews prior studies pertaining to personalized search. We also

consider several studies using the ODP taxonomy to represent a search context, review

studies on the taxonomy of Web activities, and end by briefly discussing text

categorization.

According to Lawrence [2000], next generation search engines will increasingly

use context information. Pitkow et al. [2002] also suggest that a contextual computing

approach that enhances user interactions through a greater understanding of the user, the

context, and the applications may prove a breakthrough in personalized search efficiency.

They further identify two primary ways to personalize search: query expansion and result

processing [Pitkow et al. 2002] which can complement each other.

20

2.2.1 Query Expansion

We use an approach similar to query expansion for finding terms related with user

interests in our interest mapping framework. Query expansion refers to the process of

augmenting a query from a user with other words or phrases to improve search

effectiveness. It originally was applied in information retrieval (IR) to solve the problem

of word mismatch that arises when search engine users employ different terms than those

used by content authors to describe the same concept [Xu and Croft 1996]. Because the

word mismatch problem can be reduced through the use of longer queries, query

expansion may offer a solution [Xu and Croft 1996].

In line with query expansion, current literature provides various definitions of

context. In the Inquirus 2 project [Glover et al. 1999], a user manually chooses a context

in the form of a category, such as “research papers” or “organizational homepages,”

before starting a search. Y!Q4, a large-scale contextual search system, allows a user to

choose a context in the form of a few words or a whole article through three methods: a

novel information widget executed in the user’s Web browser, Yahoo! Toolbar5 or

Yahoo! Messenger6 [Kraft et al. 2005]. In the Watson project, Budzik and Hammond

[2000] derive context information from the whole document a user views. Instead of

using a whole document, Finkelstein et al. [2002] limit the context to the text surrounding

a user-marked query term(s) in the document. That text is part of the whole document, so

their query expansion is based on a local context analysis approach [Xu and Croft 1996].

4 http://yq.search.yahoo.com.5 http://toolbar.yahoo.com.6 http://beta.messenger.yahoo.com.

21

Leroy et al. [2003] define context as the combination of titles and descriptions of clicked

search results after an initial query. In all these studies, queries get expanded on the basis

of the context information, and results are generated according to the expanded queries.

2.2.2 Result Processing

Relatively fewer studies deal with result processing, which includes result

filtering and reorganizing. Domain filtering eliminates documents irrelevant to given

domains from the search results [Oyama et al. 2004]. For example, Ahoy!, a homepage

finder system, uses domain-specific filtering to eliminate most results returned by one or

more search engines but retain a few pages that are likely to be personal homepages

[Shakes et al. 1997]. Tan and Teo [1998] propose a system that filters out news items that

may not be of interest to a given user according to that user’s explicit (e.g., satisfaction

ratings) and implicit (e.g., viewing order, duration) feedback to create personalized news.

Another approach to result processing is to reorganize, which involves re-ranking,

clustering, and categorizing search results. For example, Teevan et al. [2005] construct a

user profile (context) over time with rich resources including issued queries, visited Web

pages, composed or read documents and e-mails. When the user sends a query, the

system re-ranks the search results on the basis of the learned profile. Shen et al. [2005a]

use previous queries and summaries of clicked results in the current session to re-rank

results for a given query. Similarly, UCAIR [Shen et al. 2005b], a client-side

personalized search agent, employs both query expansion on the basis of the immediately

preceding query and result re-ranking on the basis of summaries of viewed results. Other

22

works also consider re-ranking according to a user profile [Gauch et al. 2003; Sugiyama

et al. 2004; Speretta and Gauch 2005; Chirita et al. 2005; Kraft et al. 2005]. Gauch et al.

[2003] and Sugiyama et al. [2004] learn a user’s profile from his or her browsing history,

whereas Speretta and Gauch [2005] build the profile on the basis of search history, and

Chirita et al. [2005] require the user to specify the profile entries manually.

Scatter/Gather [Cutting et al. 1992] is one of the first systems to present

documents in clusters. Another system, Grouper [Zamir and Etzioni 1999], uses snippets

of search engine results to cluster the results. Tan [2002] presents a user-configurable

clustering approach that clusters search results using titles and snippets of search results

and the user can manually modify these clusters.

Finally, in comparing seven interfaces that display search results, Dumais and

Chen [2001] report that all interfaces that group results into categories are more effective

than conventional interfaces that display results as a list. They also conclude that the best

performance occurs when both category names and individual page titles and summaries

are presented. We closely follow these recommendations for the two categorization

systems we study (PCAT and CAT). In recent work, Käki [2005] also finds that result

categorization is helpful when the search engine fails to provide relevant results at the top

of the list.

2.2.3 Representing Context Using Taxonomy

In our approach we map user interests to categories in the ODP taxonomy. Figure

6 shows a portion of the ODP taxonomy, in which Computers is a depth-one category and

23

C++ and Java are categories at depth four. We refer to Computers/Programming/

Languages as the parent category of category C++ or Java. Hence various concepts

(categories) are related through a hierarchy in the taxonomy. Currently, the ODP is a

manually edited directory of 4.6 million URLs that have been categorized into 787,774

categories by 68,983 human editors. The ODP taxonomy has been applied to

personalization of Web search in some prior studies [Pitkow et al. 2002, Gauch et al.

2003, Liu et al. 2004 and Chirita et al. 2005].

Figure 6. ODP taxonomy

For example, the Outride personalized search system (acquired by Google)

performs both query modification and result processing. It builds a user profile (context)

on the basis of a set of personal favorite links, the user’s last 1000 unique clicks, and the

ODP taxonomy, then modifies queries according to that profile. It also re-ranks search

results on the basis of usage and the user profile. The main focus of the Outride system is

24

capturing a user’s profile through his or her search and browsing behaviors [Pitkow et al.

2002]. The OBIWAN system [Gauch et al. 2003] automatically learns a user’s interest

profile from his or her browsing history and represents those interests with concepts in

Magellan taxonomy. It maps each visited Web page into five taxonomy concepts with the

highest similarities; thus, the user profile consists of accumulated categories generated

over a collection of visited pages. Liu et al. [2004] also build a user profile that consists

of previous search query terms and five words that surround each query term in each

Web page clicked after the query is issued. The user profile then is used to map the user’s

search query onto three depth-two ODP categories. In contrast, Chirita et al. [2005] use a

system in which a user manually selects ODP categories as entries in his or her profile.

When re-ranking search results, they measure the similarity between a search result and

the user profile using the node distance in an taxonomy concept tree, which means the

search result must associate with an ODP category. A difficulty in their study is that

many parameters’ values have been set without explanations. The current Google

personalized search7 also explicitly asks users to specify their interests through the

Google directory.

Similar to Gauch et al. [2003], we represent user interests with taxonomy

concepts, but we do not need to collect browsing history. Unlike Liu et al. [2004], we do

not need to gather previous search history, such as search queries and clicked pages, or

know the ODP categories corresponding to the clicked pages. Whereas Gauch et al

[2003] map a visited page onto five ODP categories and Liu et al. [2004] map a search

query onto three categories, we automatically map a user interest onto an ODP category.

A difference between [Chirita et al. 2005] and our approach is that when mapping a

7 http://labs.google.com/personalized.

25

user’s interest onto an taxonomy concept, we employ text, i.e. page titles and summaries

associated with the concept in taxonomy, while they use the taxonomy category title and

its position in the concept tree when computing the tree-node distance. Also, in contrast

to UCAIR [Shen et al. 2005b] that uses contextual information in the current session

(short-term context) to personalize search, our approach personalizes search according to

user’s long-term interests which may be extracted from his or her resume.

Haveliwala [2002] and Jeh and Widom [2003] extend the PageRank algorithm

[Brin and Page 1998] to generate personalized ranks. Using 16 depth-one categories in

ODP, Haveliwala [2002] computes a set of topic-sensitive PageRank scores. The original

PageRank is a global measure of the query- or topic-insensitive popularity of Web pages,

measured solely by a linkage graph derived from a large part of the Web. Haveliwala’s

experiments indicate that, compared with the original PageRank, a topic-sensitive

PageRank achieves greater precision in top-ten search results. Topic-sensitive PageRank

also can be used for personalization after a user’s interests have been mapped onto

appropriate depth-one categories of the ODP, which can be achieved through our

proposed mapping framework. Jeh and Widom [2003] present a scalable personalized

PageRank method, in which they identify a linear relationship between basis vectors and

the corresponding personalized PageRank vectors. At query time, their method constructs

an approximation to the personalized PageRank vector from the pre-computed basis

vectors.

2.2.4 Taxonomy of Web Activities

26

We study the performance of the three systems (described in Section 1) by

considering different types of Web activities. Sellen et al. [2002] categorize Web

activities into six categories: Finding (locate something specific), Information Gathering

(answer a set of questions; less specific than Finding), Browsing (visit sites without

explicit goals), Transacting (execute a transaction), Communicating (participate in chat

rooms or discussion groups), and Housekeeping (check the accuracy and functionality of

Web resources). As Craswell et al. [2001] define a Site Finding task specifically as "one

where the user wants to find a particular site, and their query names the site," we consider

it a type of Finding task. It should be noted that some Web activities, especially

Information Gathering, can involve several searches. On the basis of the intent behind

Web queries, Broder [2002] classifies Web searches into three classes: Navigational

(reach a particular site), Informational (acquire information from one or more Web

pages), and Transactional (perform some Web-mediated activities). As the taxonomy of

search activities suggested by Sellen et al. [2002] is broader than that by Broder [2002],

in this article we choose to study the two major types of activities studied in [Sellen et al.

2002].

2.2.5 Text Categorization

In our study, CAT and PCAT systems employ text classifiers to categorize search

results. Text categorization (TC) is a supervised learning task that classifies new

documents into a set of predefined categories [Yang and Liu 1999]. As a joint discipline

of machine learning and IR, TC has been studied extensively, and many different

27

classification algorithms (classifiers) have been introduced and tested, including the

Rocchio method, naïve Bayes, decision tree, neural networks, and support vector

machines [Sebastiani 2002]. A standard information retrieval metric, cosine similarity

[Salton and McGill 1986], computes the cosine angle between vector representations of

two text fragments or documents. In TC, a document can be assigned to the category with

the highest similarity score. Due to its simplicity and effectiveness, cosine similarity has

been used by many studies for TC [e.g. Yang and Liu 1999; Sugiyama et al. 2004; Liu et

al. 2004].

In summary, to generate user profiles for personalized search, previous studies

have asked users for explicit feedback, such as ratings and preferences, or collected

implicit feedback, such as search and browsing history. However, users are unwilling to

provide explicit feedback, even when they anticipate a long-run benefit [Caroll and

Rosson 1987]. Implicit feedback has shown promising results for personalizing search

using short-term context [Leroy et al. 2003, Shen et al. 2005b]. However, generating user

profiles for long-term context through implicit feedback will take time and may raise

privacy concerns. In addition, a user profile generated from implicit feedback may

contain noise, because the user preferences have been estimated from behaviors and not

explicitly specified. In our approach two user-related inputs: a search query and the user’s

professional interests and skills are explicitly given to a system, so some prior work

[Leroy et al. 2003; Gauch et al. 2003; Liu et al. 2004; Sugiyama et al. 2004; Kraft et al.

2005] that relies on modeling user interests through searching or browsing behavior is not

readily applicable.

28

3 OUR APPROACH

Our approach begins with the assumption that some user interests are known and

therefore is well suited for a workplace setting in which employees’ resumes often are

maintained in a digital form or information about users’ professional interests and skills

is stored in a database. An IE tool or database queries can extract such information as

input to complement the search query, search engine, and contents of the ODP taxonomy.

However, we do not include such an IE program in this study and instead assume that the

interests have been already given. Our interest-category mapping framework tries to

automatically identify an ODP category associated with each of the given user interests.

Then our system uses URLs organized under those categories as training examples to

classify search results into various user interests at query time. We expect the result

categorization to help the user quickly focus on results of interest and decrease total time

spent in searching. The result categorization also may lead to the discovery of

serendipitous connections between the concepts being searched and the user’s other

interests. This form of personalization therefore should reduce search effort and possibly

provide interesting and useful resources the user would not notice otherwise. We focus on

work-related search performance, but our approach could be easily extended to include

personal interests as well. We illustrate a process view of our proposed approach in

Figure 7 and present our approach in five steps. Steps 3 and 4 cover the mapping

framework.

29

Figure 7. Process view of proposed approach

3.1 Step 1: Obtaining an Interest Profile

Step 1 (Figure 7) pertains to how the interests can be extracted from a resume.

Our study assumes that user interests are available to our personalized search system in

the form of a set of words and phrases which we call a user’s interest profile.

3.2 Step 2: Generating Category Profiles

As we explained previously, ODP is a manually edited Web directory with

millions of URLs placed under different categories. Each ODP category contains URLs

that point to external Web pages that human editors consider relevant to the category.

Those URLs are accompanied by manually composed titles and summaries that we

believe accurately represent the corresponding Web page content. The category profile of

an ODP category thus is built by concatenating the titles and summaries of the URLs

30

listed under the category. The constructed category profiles provide a solution to the

cold-start problem, which arises from the difficulty of creating a profile for a new user

from scratch [Maltz and Ehrlich 1995] and later serve to categorize the search results.

Gauch et al. [2003], Menczer et al. [2004], and Srinivasan et al. [2005] use similar

concatenation to build topic profiles. In our study, we combine up to 30 pairs of manually

composed titles and summaries of URL links under an ODP category as the category

profile.8 In support of this approach, Shen et al. [2004] report that classification using

manually composed summarization in the LookSmart Web directory achieves higher

accuracy than the use of the content of Web pages. For building the category profile, we

pick the first 30 URLs based on the sequence in which they are provided by ODP. We

note that ODP can have more than 30 URLs listed under a category. In order to use

similar amount of information for creating profiles for different ODP categories we only

use the titles and summaries of the first 30 URLs. When generating profiles for categories

in Magellan taxonomy, Gauch et al. [2003] show that a number of documents between 5

and 60 provide reasonably accurate classification.

At depth-one, ODP contains 17 categories (for a depth-one category, Computers,

see Figure 6). We select five of these (Business, Computers, Games, Reference, and

Science) that are likely to be relevant to our subjects and their interests. These five broad

categories comprise a total of 8,257 categories between depths one and four. We generate

category profiles by removing stop words and applying the Porter stemming9 [Porter

1980]. We also filter out any terms that appear only once in a profile to avoid noise and

remove any profiles that contain fewer than two terms. Finally, the category profile is

8 A category profile does not include titles or summaries of its child (subcategory) URLs.9 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/porter.java.

31

represented as a term vector [Salton and McGill, 1986] with term frequencies (tf) as

weights. Shen et al. [2004] also use tf-based weighting scheme to represent manually

composed summaries in the LookSmart Web directory to represent a Web page.

3.3 Step 3: Mapping Interests to ODP Categories

Next, we need a framework to map a user’s interests onto appropriate ODP

categories. The framework then can identify category profiles for building text classifiers

that correspond to the user’s interests. Some prior studies [Pitkow et al. 2002; Liu et al.

2004] and the existing Google personalized search use ODP categories with a few

hundred categories up to depth two, but for our study, categories up to depth two may

lack sufficient specificity. For example, Programming, a depth-two category, is too broad

to map a user interest in specific programming languages such as C++, Java, or Perl.

Therefore, we map user interests to ODP categories up to depth four. As we mentioned in

Step 2, a total of 8,257 such categories can be used for interest mapping. We employ four

different mapping methods to evaluate the mapping performance by testing and

comparing them individually, as well as in different combinations. When generating an

output category, a mapping method includes the parent category of the mapped category;

for example, if the mapped category is C++, the output will be Computers/Programming/

Languages/C++.

32

3.3.1 Mapping Method 1 (m1-category-label): Simple Term Match

The first method uses a string comparison to find a match between an interest and

the label of the category in ODP. If an interest is the same as a category label, the

category is considered a match to the interest. Plural forms of terms are transformed to

their singular forms by a software tool from the National Library of Medicine.10

Therefore, the interest of "search engine" is matched with the ODP category "Search

Engines,” and the output category is “Computers/Internet/Searching/Search Engines.”

3.3.2 Mapping Method 2 (m2-category-profile): Most Similar Category Profile

The cosine similarities between an interest and each of the category profiles are

computed, in which case the ODP category with the highest similarity is selected as the

output.

3.3.3 Mapping Method 3 (m3-category-profile-noun): Most Similar Category Profile

While Augmenting Interest With Potentially Related Nouns

The m1-category-label and m2-category-profile will fail if the category labels and

profiles do not contain any of the words that form a given interest, so it may be

worthwhile to augment the interest concept by adding a few semantically similar or

related terms. According to [Harris 1985], terms in a language do not occur arbitrarily but

appear at a certain position relative to other terms. On the basis of the concept of co-

occurrence, Riloff and Shepherd [1997] present a corpus-based bootstrapping algorithm

10 http://umlslex.nlm.nih.gov/nlsRepository/nlp/doc/userDoc/index.html.

33

that starts with a few given seed words that belong to a specific domain and discovers

more domain-specific, semantically related lexicons from a corpus. Similarly to query

expansion, it is desirable to augment the original interest with a few semantically similar

or related terms.

For m3-category-profile-noun, one of our programs conducts a search on Google

using an interest as a search query and finds the N nouns that most frequently co-occur in

the top ten search results (page titles and snippets). We find co-occurring nouns because

most terms in interest profiles are nouns (for terms from some sample user interests, see

Table 1). Terms semantically similar or related to those of the original interest thus can

be obtained without having to ask a user for input such as feedback or a corpus. A noun is

identified by looking up the word in a lexical reference system,11 WordNet [Miller et al.

1990], to determine whether the word has the part-of-speech tag of noun. The similarities

between a concatenated text (a combination of the interest and N most frequently co-

occurring nouns) and each of the category profiles then are computed to determine the

category with the highest similarity as the output of this method.

3.3.4 Mapping Method 4 (m4-category-profile-np): Most Similar Category Profile

While Augmenting Interest With Potentially Related Noun Phrases

Although similar to m3-category-profile-noun, this method finds the M most

frequently co-occurring noun phrases on the first result page from up to ten search

results. We developed a shallow parser program to parse sentences in the search results

into NPs (noun phrases), VPs (verb phrases), and PPs (prepositional phrases), where a NP

11 http://wordnet.princeton.edu/.

34

can appear in different forms, such as a single noun, a concatenation of multiple nouns,

an article followed by a noun, or any number of adjectives followed by a noun.

Table 1 lists some examples of frequently co-occurring nouns and NPs identified by m3-

category-profile-noun and m4-category-profile-np. Certain single-noun NPs generated by

m4-category-profile-np differ from individual nouns identified by m3-category-profile-

noun because a noun identified by this method may combine with other terms to form a

phrase in m4-category-profile-np and therefore not be present in the result generated by

m4-category-profile-np.

Table 1.

Frequently Co-occurring Nouns and NPs

Domain Interest Two co-occurring nouns Co-occurring NP

Computer

C++ programme, resource general cIBM DB2 database, software databaseJava tutorial, sun sunMachine Learning information, game ai topicNatural Language Processing intelligence, speech intelligence

Object Oriented Programming concept, link data

Text Mining information, data text mine toolUML model tool acceptance *

Web Site Design html, development library resource web development

Finance Bonds saving, rate saving bondDay Trading resource, article bookDerivatives trade, international goldMutual Funds news, stock accountOffshore Banking company, formation bank accountRisk Management open source * software risk evaluation *Stocks Exchange trade, information official siteTechnical Analysis market, chart market pullback

35

Trading Cost service, cap product* Some co-occurring nouns or NPs may be not semantically similar or related.

3.4 Step 4: Resolving Mapped Categories

For a given interest, each mapping method in step 3 may generate a different

mapped ODP category, and m1-category-label may generate multiple ODP categories for

the same interest because the same category label sometimes is repeated in the ODP

taxonomy. For example the category “Databases” appears in several different places in

the hierarchy of the taxonomy, such as “Computers/Programming/Databases” and

“Computers/Programming/Internet/Databases.”

Using 56 professional interests in the computer domain, which were manually

extracted from several resumes of professionals collected from ODP (eight interests are

shown in the first column of Table 1), Table 2 compares the performances of each

individual mapping method. After verification by a domain expert, m1-category-label

generated mapped categories for 29 of 56 interests, and only two did not contain the right

category. We note that m1-category-label has much higher precision than the other three

methods, but it generates the fewest mapped interests. Machine learning research [e.g.,

Dietterich 1997] has shown that an ensemble of classifiers can outperform each classifier

in that ensemble. Since the mapping methods can be viewed as classification techniques

that classify interests into ODP categories, a combination of the mapping methods may

outperform any one method.

36

Table 2.

Individual Mapping Method Comparison (Based on 56 Computer Interests)

Mapping method m1 m2 m3 m4Number of correctly mapped interests 27 29 25 19Number of incorrectly mapped interests 2 25 30 36Number of total mapped interests 29 54 55 55Precision (

) 93.0% 53.7% 45.5% 34.5%

Recall ( ) 48.2% 51.8% 44.6% 33.9%

F1 63.5% 52.7% 45.0% 34.2%

Figure 8 lists the detailed pseudo-code of procedure used to automatically resolve

a final set of categories for an interest profile with the four mapping methods. M1

represents a set of mapped category/categories generated by m1-category-label; as do

M2, M3, and M4. Because of its high precision, we prioritize the category/categories

generated by m1-category-label as shown in step (2); if a category generated by m1-

category-label is the same as or a parent category of a category generated by any other

method, we include the category generated by m1-category-label in the list of final

resolved categories. Because m1-category-label uses an “exact match” strategy, it does

not always generate a category for a given interest. In step (3), if methods m2-category-

profile, m3-category-profile-noun, and m4-category-profile-np generate the same mapped

category, we select that category, irrespective of whether m1-category-label generates

one. Steps (2) and (3) attempt to produce a category for an interest by considering

overlapping categories from different methods. If no such overlap is found, we look for

overlapping categories generated for different interests in step (6), because if more than

one interest is mapped to the same category, it is likely to be of interest. In step (8), we

37

try to represent all remaining categories at a depth of three or less by truncating the

category at depth four and thereby hope to find overlapped categories through the parent

categories. Step (9) is similar to step (5), except that all remaining categories are at the

depth of three or less.

(1) For each interest i in interest profile Given i, the four mapping methods generate M1, M2, M3, and M4(2) For each category c in M1 If c is the same as or a parent of a category in M2, M3, or M4, add c to a list of final categories, then go to step (1) End For(3) If M2, M3, and M4 contain the same category c, add c into the list of final categories, then go to step (1)(4) Put any category c in M1, M2, M3, and M4 into a list of candidate categories12

End For(5) For each category c in candidate categories Count the frequency for c End For(6) For each depth-four category c in candidate categories If frequency of c >= threshold, add c into final categories. (We chose the threshold equal to the number of mapping methods – 1. The threshold was three in our tests because we used four mapping methods. The number of three or larger means there is an overlap of candidate category between at least two different interests. Then we choose the overlapped candidate category to represent these interests.) End For(7) Removing all candidate categories for the mapped interests in step (6)(8) Resolving all remaining categories of depth four into depth three by truncating the category at depth four. For example after truncating to depth three from depth four, reference/knowledge management/publications/articles is resolved as reference/ knowledge management/publications(9) For each category c in candidate categories Count the frequency for c End For(10) For each depth-three category c in candidate categories If frequency of c >= threshold, add c into final categories End For

Figure 8. Category resolving procedures

12 Candidate categories cannot be used as final resolved categories unless the frequency of a candidate category is greater than or equal to the threshold in step (6).

38

To determine appropriate values for N (number of nouns) and M (number of NPs)

for m3-category-profile-noun and m4-category-profile-np, we tested different

combinations of values ranging from 1 to 3 with the 56 computer interests. According to

the number of correctly mapped interests, choosing the two most frequently co-occurring

nouns and one most frequently co-occurring NP offers the best mapping result (see Table

1 for some examples of identified nouns and NPs.) With the 56 interests, Table 3

compares the number of correctly mapped interests when different mapping methods are

combined. Using all four mapping methods provides the best results; 39 of the 56

interests were correctly mapped onto ODP categories. The resolving procedures in Figure

8 thus are based on four mapping methods. When using three methods, we adjusted the

procedures accordingly, such as setting the thresholds in steps (6) and (10) to two instead

of three.

Table 3.

Comparison of Combined Mapping Methods

Combination of mapping methodsm1+m2+

m3m1+m2+

m4m1+m3+

m4m1+m2+m3

+m4

Number of correctly mapped interests 34 35 32 39Precision (

)* 60.7% 62.5% 57.1% 69.6%

* Recall and F1 were same as precision because the number of mapped interests was 56.

39

Table 4 lists mapped and resolved categories for some interests in computer and

finance domains.

Table 4.

Resolved Categories

Domain Interest ODP category

Computer

C++ computers/programming/languages/c++IBM DB2 computers/software/databases/ibm db2Java computers/programming/languages/javaMachine Learning computers/artificial intelligence/machine learningNatural Language Processing computers/artificial intelligence/natural language

Object Oriented Programming computers/software/object-oriented

Text Mining reference/knowledge management/ knowledge discovery/text mining

UML computers/software/data administration *Web Site Design computers/internet/web design and development

Finance

Bonds business/investing/stocks and bonds/bondsDay Trading business/investing/day tradingDerivatives business/investing/derivativesMutual Funds business/investing/mutual fundsOffshore Banking business/financial services/offshore servicesRisk Management business/management/software *Stocks Exchange business/investing/stocks and bonds/exchanges

Technical Analysis business/investing/research and analysis/technical analysis

Trading Cost business/investing/derivatives/brokerages*Because the mapping and resolving steps are automatic, some resolved categories are

erroneous.

40

After the automatic resolving procedures, mapped categories for some interests

may not be resolved because different mapping methods generate different categories.

Unresolved interests can be handled by having the user manually map them onto the ODP

taxonomy. An alternative approach could use a unresolved user interest as a query to a

search engine (in a manner similar to m3-category-profile-noun and m4-category-profile-

np), then combine the search results, such as page titles and snippets, to compose an ad

hoc category profile for the interest. Such a profile could flexibly represent any interest

and avoid the limitation of taxonomy, in that it contains a finite set of categories. It would

be worthwhile to examine the effectiveness of such ad hoc category profiles in a future

study. In this article, user interests are fully mapped and resolved to ODP categories.

These four steps are performed just once for each user, possibly during a software

installation phase, unless the user’s interest profile changes. To reflect such a change in

interests, our system can automatically update the mapping periodically or allow a user to

request an update from the system. As shown in Figure 7, the first four steps can be

performed in a client-side server, such as a machine on the organization’s intranet, and

the category profiles can be shared by each user’s machine.

Finally, user interests, even long-term professional ones, are dynamic in nature. In

the future, we will explore more techniques to learn about and fine-tune interest mapping

and handle the dynamics of user interests.

3.5 Step 5: Categorizing Search Results

41

When a user submits a query, our system obtains search results from Google and

downloads the content of up to top-50 results, which correspond to the first five result

pages. The average number of result pages viewed by a typical user for a query is 2.35

[Jansen et al. 2000], and a more recent study [Jansen et al. 2005] reports that about 85-

92% of users view no more than two result pages. Hence, our system covers

approximately double the number of results normally viewed by a search engine user. On

the basis of page content, the system categorizes the results into various user interests. In

PCAT, we employ a user’s original interests as class labels, rather than the ODP category

labels, because the mapped and resolved ODP categories are associated with user

interests. Therefore, the use of ODP (or any other Web directory) is transparent to the

user. A Web page that corresponds to a search result is categorized by (1) computing the

cosine similarity between the page content and each of the category profiles of the

mapped and resolved ODP categories that correspond to user interests and (2) assigning

the page to the category with the maximum similarity if the similarity is greater than a

threshold. If a search result does not fall into any of the resolved user interests, it is

assigned to the “Other” category.

The focus of our study is to explore the use of PCAT, an implementation based on

the proposed approach, and compare it with LIST and CAT. With regard to interest

mapping and result categorization (classification problems), we choose the simple and

effective cosine similarity instead of comparing different classification algorithms and

selecting the best one.

42

3.6 IMPLEMENTATION

We developed three search systems13 with different interfaces to display search

results, and the online searching portion was implemented as a wrapper on Google search

engine using the Google Web API.14 Although the current implementation of our

approach uses a single search engine (Google), following the metasearch approach

[Dreilinger and Howe 1997], it can be extended to handle results from multiple engines.

Because Google has become the most popular search engine15, we use Google’s search

results to feed the three systems. That is, the systems have the same set of search results

for the same query; recall that LIST can be considered very similar to Google. For

simplicity, we limit the search results in each system to Web pages in HTML format. In

addition, for a given query, each of the systems retrieves up to 50 search results.

PCAT and CAT download the contents of Web pages that correspond to search results

and categorize them according to user interests and ODP categories, respectively. For

faster processing, the systems use multithreading for simultaneous HTTP connections

and download up to 10KB of text for each page. It took our program about five seconds

to fetch 50 pages. We note that our page fetching program is not an industry strength

module and much better concurrent download speeds have been reported by other works

[Hafri and Djeraba 2004, Najork and Heydon 2001]. Hence, we feel that our page-

fetching time can be greatly reduced in a production implementation. After fetching the

pages, the systems remove stop words and perform word stemming before computing the

cosine similarity between each page content and a category profile. Each Web page is

13 In experiments, we named the systems A, B, or C; in this article, we call them PCAT, LIST, or CAT, respectively.14 http://www.google.com/apis/.15 http://www.comscore.com/press/release.asp?press=873.

43

assigned to the category (and its associated interest for PCAT) with the greatest cosine

similarity. However, if the similarity is not greater than a similarity threshold, the page is

assigned to the “Other” category. We determined the similarity threshold by testing query

terms from “irrelevant” domains (not relevant to any of the user’s interests). For example,

given that our user interests are related to computer and finance, we tested ten irrelevant

queries, such as “NFL,” “Seinfeld,” “allergy,” and “golden retriever.” For these irrelevant

queries, when we set the threshold at 0.1, at least 90% (often 96% or higher) of retrieved

results were categorized under the “Other” category. Thus we chose 0.1 as our similarity

threshold. The time for classifying results according to user interests in PCAT is

negligible (tens of milliseconds). However, the time for CAT is three magnitudes greater

than that for PCAT because the number of potential categories for CAT is 8,547, whereas

the number of interests is less than 8 in PCAT.

Figure 9 displays a sample output from PCAT for the query “regular expression.”

Once a user logs in with his or her unique identification, PCAT displays a list of the

user’s interests on top of the GUI. After a query is issued, search results are categorized

into various interests and displayed in the result area, as shown in Figure 9. A number

next to the interest indicates how many search results are classified under that interest; if

there is no classified search result, the interest will not be displayed in the result area.

Under each interest (category), PCAT (CAT) shows no more than three results on the

main page. If more than three results occur under an interest or category, a “More” link

appears next to the number of results. (In Figure 9, there is a “More” link for the interest

of “Java.”) Upon clicking this link, the user sees all of the results under that interest in a

new window, as shown in Figure 10.

44

Figure 9. Sample output of PCAT. Category titles are user interests mapped and

resolved to ODP categories

45

user interests result area previous task next task copy paste query field

Figure 10. “More” window to show all of the results under the interest “Java”

46

Figure 11. Sample output of LIST

Figure 11 displays a sample output of LIST for the same query “regular

expression” and shows all search results in the result area as a page-by-page list. Clicking

a page number causes a result page, with up to ten results, to appear in the result area of

the same window. For the search task in Figure 11, the first relevant document is shown

as the sixth result on page 2 in LIST.

47

Figure 12. displays a sample output for CAT, in which the category labels in the

result area are ODP category names sorted alphabetically, such that output categories

under “business” are displayed before those under “computers.”

Figure 12. Sample output of CAT. Category labels are ODP category titles

We now describe some of the features of the implemented systems that would not

appear in a production system but are meant only for experimental use. We predefined a

set of search tasks the subjects used to conduct searches during the experiments that

48

specified what information and how many Web pages needed to be found (Section 5.2.2

describes the search tasks in more detail.) Each search result consists of a page title,

snippet, URL, and a link called “relevant”16 next to the title. Except for the “relevant”

link, the items are the same as those found in typical search engines. A subject can click

the hyperlinked page title to open the page in a regular Web browser, such as the Internet

Explorer. The subject determines whether a result is relevant to a search task by looking

at the page title, snippet, URL, and/or the content of the page.

Many of our search tasks require subjects to find one relevant Web page for a

task, but some require two. In Figure 9, the task requires finding two Web pages which is

also indicated by the number “2” at the end of the task description. Once the user finds

enough relevant pages, he or she can click the “Next” button to proceed to the next task;

clicking on “Next” before enough relevant page(s) have been found prompts a warning

message, which allows the user to either give up or continue the current search task.

We record search time, or the time spent on a task, as the difference between the time that

the search results appear in the result area and the time that the user finds the required

number of relevant result(s).

4 EXPERIMENTS

We conducted two sets of controlled experiments to examine the effects of

personalization and categorization. In experiment I we compare PCAT with LIST, that is,

a personalized system that uses categorization versus a system similar to a typical search

16 When a user clicks on the “relevant” link, the corresponding search result is treated as the answer or solution for the current search task. This clicked result is considered as relevant, and is not necessarily the most relevant among all search results.

49

engine. Experiment II compares PCAT with CAT in order to study the difference

between personalization and non-personalization, given that categorization is common to

both systems. These experiments were designed to examine whether subjects’ mean log

search time17 for different types of search tasks and query lengths varied between the

compared systems. The metric evaluates the efficiency of each system, because all three

systems return the same set of search results for the same query. Before experiment I, we

conducted a preliminary experiment comparing PCAT and LIST with several subjects

who later did not participate in either the experiment I or II. The preliminary experiment

helped us make decisions relating to experiment and system design. Next, we introduce

our experiments I and II in detail.

4.1 Studied Domains and Domain Experts

Because we were interested in personalizing search according to a user’s

professional interests, we chose two representative professional domains, computer and

finance, that appear largely disjointed.

For the computer domain, two of the authors, who are researchers in the area of

information systems, served as the domain experts. Both experts also have industrial

experiences related to computer science. For the finance domain, one expert has a

doctoral degree and the other has a master’s degree in finance.

17 Mean log search time is the average log-transformed search time for a task across a group of subjects using the same system. We transformed the original search times (measured in seconds) with base 2 log to make the log search times closer to a normal distribution. In addition, taking the average makes the mean log search times more normally distributed.

50

4.2 Professional Interests, Search Tasks, and Query Length

4.2.1 Professional interests (interest profiles)

For each domain, the two domain experts manually chose several interests and

skills that could be considered fundamental, which enables us to form a generic interest

profile that would be shared by all subjects within the domain. Moreover, the

fundamental nature of these interests allows us to recruit more subjects, leading to greater

statistical significance in our results. By defining some fundamental skills in the

computer domain, such as programming language, operating system, database, and

applications, the two computer domain experts identified six professional interests:

algorithms, artificial intelligence, C++, Java, Oracle, and Unix. Similarly, the two finance

experts provided seven fundamental professional interests: bonds, corporate finance, day

trading, derivatives, investment banking, mutual funds, and stock exchange.

4.2.2 Search tasks

The domain experts generated search tasks on the basis of the chosen interest

areas but also considered different types of tasks, i.e., Finding and Information Gathering.

The content of those search tasks include finding a software tool, locating a person’s or

organization’s homepage, finding pages to learn about a certain concept or technique,

collecting information from multiple pages, and so forth. Our domain experts predefined

26 non-demo search tasks for each domain, as well as 8 and 6 demo tasks for the

computer and finance domains, respectively. The demo tasks were similar to but not

identical to the non-demo tasks and therefore offer subjects some familiarity with both

51

systems before they started to work on the non-demo tasks. Non-demo tasks are used in

post-experiment analysis, while demo tasks are not. All demo and non-demo search tasks

belong to the categories of Finding and Information Gathering [Sellen et al. 2002], as

discussed in section 2.4, and within the finding tasks, we included some Site Finding

tasks [Craswell et al. 2001].

4.2.3 Query length

Using different query lengths, we specified four types of queries for search tasks in

each domain:

1. One-word query (e.g., jsp, underinvestment)

2. Two-word query (e.g., neural network, security line)

3. Three-word query (e.g., social network analysis)

4. Free-form query, which had no limitations on the number of words used

For a given task a user was free to enter any query word(s) of his or her own choice

that conformed to the associated query-length requirement, and the user could issue

multiple queries for the same task. For example, Table 5 shows some sample search

tasks, types of search tasks, and their associated query lengths.

Table 5.

Examples of Search Tasks, Types of Tasks, and Query Lengths

Domain Search task Type of search task Query lengthComputer You need an open source IDE Finding one-word

52

(Integrated Development Environment) for C++. Find a page that provides any details about such an IDE.

Computer You need to provide a Web service to your clients. Find two pages that describe Web services support using Java technology.

Information Gathering two-word

Finance Find a portfolio management spreadsheet program.

Finding three-word

Finance Find the homepage of New York Stock Exchange.

Site Finding free-form

Table 6 lists the distributions of search tasks and their associated query lengths.

For each domain, we divided the 26 non-demo search tasks and demo tasks into two

groups, such that the two groups have the same number of tasks and distribution of query

lengths. During each experiment, subjects searched for the first group of tasks using one

system and the second group of tasks using the other.

Table 6.

Distribution of Search Tasks and their Associated Query Lengths

Experiment Domain\Query length

One-word

Two-word

Three-word

Free-form

Total tasks

I & II Computer 6 6 4 10 26Finance 8 6 6 6 26

We chose these different query lengths for several reasons. First, numerous

studies show that users tend to submit short Web queries with an average length of two

53

words. A survey by the NEC Research Institute in Princeton reports that up to 70% of

users typically issue a query with one word in Web searches, and nearly half of the

Institute’s staff—who should be Web-savvy (knowledge workers and researchers)—fail

to define their searches precisely with query terms [Butler 2000]. By collecting search

histories for a two-month period from 16 faculty members across various disciplines at a

university, Käki [2005] found that the average query length was 2.1 words. Similarly,

Jansen et al. [1998] find through their analysis of transaction logs on Excite that on

average a query contains 2.35 words. In yet another study, Jansen et al. [2000] report that

the average length of a search query is 2.21 words. From their analysis of users’ logs in

the Encarta encyclopedia, Wen et al. [2002] report that the average length of Web queries

is less than 2 words.

Second, we chose different query lengths to simulate different types of Web

queries and examine how these different types affect system performance. A prior study

follows a similar approach; in comparing the IntelliZap system with four popular search

engines, Finkelstein et al. [2002] set the length of queries to one, two, and three words

and allow users to type in their own query terms.

Third, in practice, queries are often incomplete or may not incorporate enough

contextual information, which leads to many irrelevant results and/or relevant results that

do not appear at the top of the list. A user then has two obvious options: Enter a different

query to start a new search session or go through the long result list page by page, both of

which consume time and effort. From a study with 33,000 respondents, Sullivan [2000]

finds that 76% of users employ the same search engine and engage in multiple search

sessions on the same topic. To investigate this problem of incomplete or vague queries,

54

we associate search tasks with different query lengths to simulate the real-world problem

of incomplete or vague queries. We believe that categorization will present results in such

a way to help disambiguate such queries. Unlike Leroy et al. [2003], who extract extra

query terms from users’ behaviors during consecutive searches, we do not modify users’

queries but rather observe how a result-processing approach (personalized categorization

of search results) can improve search performance.

4.3 Subjects

Prior to the experiments, we sent e-mails to students in the business school and the

computer science department of our university, as well as to some professionals in the

computer industry, to solicit their participation. In these e-mails, we explicitly listed the

predefined interests and skills we expected potential subjects to have. We also asked

several questions, including the following two self-reported ones:

1. When searching online for topics in the computer or finance domain, what do you

think of your search performance (with a search engine) in general?

(a) slow (b) normal (c) fast

2. How many hours do you spend on online browsing and searching per week (not

limited to your major)?

(a) [0, 7) (b) [7+, 14) (c) [14+)

We verified their responses to ensure each subject possessed the predefined skills

and interests. After the experiments we did not manually verify the correctness of

subject-selected relevant documents. However, in our preliminary experiment with

55

different subjects, we manually examined all of the relevant documents chosen by

subjects and we confirmed that, on an average, nearly 90% of their choices were correct.

We assume that with the sufficient background the subjects were capable of identifying

the relevant pages. Because we used PCAT in both experiments, no subject from

experiment I participated in experiment II. We summarize some demographic

characteristics of the subjects in tables 7-1 through 7-3.

Table 7-1.

Educational Status of Subjects

Experiment

Domain\Status

Undergraduate

Graduate Professional

Total

I Computer 3 7 4 14Finance 4 16 0 20

II Computer 3 11 2 16Finance 0 20 0 20

Table 7-2.

Self-reported Performance on Search within a Domain

Experiment Domain\Performance Slow Normal Fast

I Computer 0 8 6Finance 2 15 3

II Computer 1 8 7Finance 2 11 7

56

Table 7-3.

Self-reported Time (hours) Spent Searching and Browsing Per Week

Experiment Domain\Time (hours)

[0, 7) [7, 14)

[14+)

I Computer 1 9 4Finance 5 10 5

II Computer 2 7 7Finance 2 11 7

To compare the two studied systems for each domain, we divided the subjects into

two groups, such that subjects in one group were as closely equivalent to the subjects in

the other as possible with respect to their self-reported search performance, weekly

browsing and searching time, and educational status. We computed the mean log search

time for a task by averaging the log search times for each group.

5.4 Experiment process

In experiment I, all subjects used both PCAT and LIST and searched for the same

demo and non-demo tasks. As we show in Table 8, the program automatically switched

between PCAT and LIST according to the task numbers and the group identified by user

id, so users in different groups always used different systems for the same task. The same

system-switching mechanism was adopted in experiment II to switch between PCAT and

CAT.

Table 8.

57

Distribution of System Uses by Tasks and User Groups

Task Group

First half demo tasks

Second half demo tasks

Non-demo tasks 1–13

Non-demo tasks 14–26

Group one PCAT LIST PCAT LISTGroup two LIST PCAT LIST PCAT

5 EVALUATIONS

In this section, we compare two pairs of systems (PCAT vs. LIST, PCAT vs.

CAT) on the basis of the mean log search time along two dimensions: query length and

type of task. We also test five hypotheses using the responses to a post-experiment

questionnaire provided to the subjects. Finally, we demonstrate the differences of the

indices of the relevant results across all tasks for the two pairs of systems.

5.1 Comparing Mean Log Search Time by Query Length

We first compared the two systems by different query lengths. Tables 9-1 and 9-2

contain the average mean log search times across tasks with the same query length and

1 standard error for different systems in the two experiments (lower values are better).

The last column of each table provides the average mean log search time across all 26

search tasks and 1 standard error. For most of the comparisons between PCAT vs.

LIST (Table 9-1) or PCAT vs. CAT (Table 9-2), for a given domain and query length,

PCAT has lower average mean log search times. We conducted two-tailed t-tests to

determine whether PCAT was significantly faster than LIST or CAT for different

58

domains and query lengths. Table 10 shows the degrees of freedom and p-values for the

t-tests. The numbers in bold in the Tables 9 and 10 highlight the systems with statistically

significant differences (p < 0.05) in average mean log search times.

Table 9-1.

Average Mean Log Search Time across Tasks Associated with Four Types of Query

(PCAT vs. LIST)

Experiment Query length

Domain-SystemOne-word Two-word Three-word Free-form Total

I(PCAT vs.

LIST)

Computer-PCATComputer-LISTFinance- PCAT 3.97 0.34Finance-LIST 5.10 0.26

Table 9-2.

Average Mean Log Search Time across Tasks Associated with Four Types of Query

(PCAT vs. CAT)

Experiment Query length

Domain-SystemOne-word Two-word Three-

word Free-form Total

II(PCAT vs.

CAT)

Computer-PCAT

4.14 0.26

3.88 0.19 4.30 0.15

Computer-CAT 4.96 0.26

4.94 0.34 5.17 0.17

Finance-PCAT 4.10 0.35 4.46 0.14

Finance-CAT 5.10 0.25 5.11 0.16

59

Table 10.

The t-test Comparisons (degrees of freedom, p-values)

Experiment Domain One-word Two-word Three-word Free-form TotalI

(PCAT vs. LIST)Computer 10, 0.058 10, 0.137 6, 0.517 18, 0.796 50, 0.116Finance 14, 0.015 10, 0.370 10, 0.752 10, 0.829 50, 0.096

II(PCAT vs. CAT)

Computer 10, 0.147 10, 0.050 6, 0.309 18, 0.013 50, 0.0005Finance 14, 0.193 10, 0.152 10, 0.237 10, 0.041 50, 0.003

In Table 10, for both computer and finance domains, PCAT has a lower mean log

search time than LIST for one-word query tasks with greater than 90% statistical

significance. The two systems are not statistically significantly different for tasks

associated with two-word, three-word, or free-form queries. Compared with a long query,

a one-word query may be more vague or incomplete, so a search engine may not provide

relevant pages in its top results, whereas PCAT may show the relevant result at the top of

a user interest. The user therefore could directly “jump” to the right category in PCAT

and locate the relevant document quickly.

Compared with CAT, PCAT has a significantly lower mean log search time for

free-form queries (p < 0.05). The better performance of PCAT can be attributed to two

main factors. First, the number of categories in the result area for CAT is often large

(about 20), so even if the categorization is accurate, the user still must commit additional

search effort to sift through the various categories. Second, the categorization of CAT

might not be as accurate as that of PCAT because of the much larger number (8,547) of

potential categories, which can be expected to be less helpful in disambiguating a vague

60

or incomplete query. The fact that category labels in CAT are longer than those in PCAT

may also have a marginal effect on the time needed for scanning them.

For all 26 search tasks, PCAT has a lower mean log search time than LIST or

CAT with 90% or higher statistical significance, except for the computer domain in

experiment I that indicates a p-value of 0.116. When computing the p-values across all

tasks, we notice that the result depends on the distribution of different query lengths and

types of tasks. Therefore, it is important to drill down the systems’ performance for each

type of task.

For reference, Table 11 illustrates the systems’ performance in terms of the

number of tasks that had a lower mean log search time for each type of query length. For

example, the table entry “4 vs. 2” for one-word query in the computer domain of

experiment I indicates that four out the six one-word query tasks had lower mean log

search time with PCAT, whereas two had a lower mean log search time with LIST.

Table 11.

Numbers of Tasks with a Lower Mean Log Search Time

Experiment Domain \ Query length One-word Two-word Three-word Free-form Total

I(PCAT vs. LIST)

Computer 4 vs. 2 6 vs. 0 3 vs. 1 6 vs. 4 19 vs. 7Finance 6 vs. 2 5 vs. 1 3 vs. 3 3 vs. 3 17 vs. 9

II(PCAT vs. CAT)

Computer 4 vs. 2 5 vs. 1 3 vs. 1 10 vs. 0 22 vs. 4Finance 6 vs. 2 6 vs. 0 5 vs. 1 6 vs. 0 23 vs. 3

5.2 Comparing Mean Log Search Time for Information Gathering Tasks

61

According to Sellen et al. [2002], during information gathering, a user finds

multiple pages to answer a set of questions. Figure 13 compares the mean log search

times of the ten search tasks in the computer domain in experiment I that required the

user to find two relevant results for each task. We sorted the tasks by the differences in

their mean log search times between PCAT and LIST. On average, PCAT allowed the

users to finish eight of ten Information Gathering tasks more quickly than did LIST

(t(18), p = 0.005), possibly because PCAT already groups the similar results into a given

category. Therefore, if in a category one page is relevant, the other results in that category

are likely to be relevant as well. This spatial localization of relevant results enables

PCAT to perform this type of task faster than LIST. For the computer domain,

experiment II has a similar result, in that PCAT is faster than CAT (t(18), p = 0.007).

Since the finance domain contains only two Information Gathering tasks (too few to

make a statistically robust argument), we only report the mean log search times for the

tasks in Table 12. We observe that the general trend of the results for the finance domain

is the same as for the computer domain (i.e., PCAT has lower search time than LIST or

CAT).

62

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

1 2 3 4 5 6 7 8 9 10

Task

Mea

n Lo

g Se

arch

Tim

e

PCAT LIST

Figure 13. Mean log search times for Information Gathering tasks (computer domain)

Table 12.

Mean Log Search Times for Information Gathering Tasks (Finance Domain)

Experiment I Experiment IIPCAT LIST PCAT CAT

Information Gathering task 1 6.33 6.96 6.23 7.64Information Gathering task 2 4.62 5.13 4.72 5.61

5.3 Comparing Mean Log Search Time for Site Finding Tasks

In the computer domain, there were six tasks related to finding particular sites,

such as “Find the home page for the University of Arizona AI Lab.” All six tasks were

associated with free-form queries, and we note that the queries from all subjects

contained site names. Therefore, according to Craswell et al. [2001], those tasks were

Site Finding tasks. Table 13 shows the average mean log search times for the Site Finding

tasks and 1 standard error. There is no significant difference (t(10), p = 0.508) between

63

PCAT and LIST, as shown in Table 14. This result seems reasonable because for this

type of search task, LIST normally shows the desired result at the top of the first result

page when the site name is in the query. Even if PCAT tended to rank it at the top of a

certain category, users often found the relevant result faster with the LIST layout,

possibly because with PCAT, the users had to move to a proper category first and then

looked for the relevant result. However, there is a significant difference between PCAT

and CAT (t(10), p = 0.019); again, the larger number of output categories in CAT may

have required more time for a user to find the relevant site, given that both CAT and

PCAT arrange the output categories alphabetically.

Table 13.

Average Mean Log Search Times for Six Site Finding Tasks in Computer Domain

Experiment System Average mean log search time

I(PCAT vs. LIST)

PCATLIST

II(PCAT vs. CAT)

PCAT 3.51 0.12CAT 4.46 0.32

5.4 Comparing Mean Log Search Time for Finding Tasks

As Table 14 shows, for 16 Finding tasks in the computer domain, we do not

observe a statistically significant difference in the mean log search time between PCAT

and LIST (t(30), p = 0.592), but the difference between PCAT and CAT is significant

(t(30), p = 0.013). However, PCAT has lower average mean log search time than both

64

LIST and CAT. Similarly for 24 Finding tasks in the finance domain, PCAT achieves a

lower mean log search time than both LIST (t(46), p = 0.101) and CAT (t(46), p = 0.002).

The computer domain includes 6 Site Finding tasks of 16 Finding tasks, whereas the

finance domain has only 2 (of 24). To a certain extent, this situation confirms our

observations about Finding tasks in the computer domain. We conclude that PCAT had a

lower mean log search time for Finding tasks than CAT but not LIST.

Table 14.

The t-tests for Finding Tasks

Experiment Domain Type of task Degrees, p-value

I(PCAT vs. LIST)

Computer Site Finding 10, 0.508Computer Finding (including Site Finding) 30, 0.592Finance Finding (including Site Finding) 46, 0.101

II(PCAT vs. CAT)

Computer Site Finding 10, 0.019Computer Finding (including Site Finding) 30, 0.013Finance Finding (including Site Finding) 46, 0.002

5.5 Questionnaire and Hypotheses

After a subject finished the search tasks with the two systems, he or she filled out

a questionnaire with five multiple-choice questions designed to compare the two systems

in terms of their usefulness and ease of use. We use their answers to test several

hypotheses relating to the two systems.

65

5.5.1 Questionnaire

Subjects completed a five-item, seven-point questionnaire in which their

responses could range from (1) strongly disagree to (7) strongly agree. (The phrase

“system B” was replaced by “system C” in experiment II. As explained in footnote 13,

systems A, B, and C refer to PCAT, LIST, and CAT, respectively.)

Q1. System A allows me to identify relevant documents more easily than system B.

Q2. System B allows me to identify relevant documents more quickly than system A.

Q3. I can finish search tasks faster with system A than with system B.

Q4. It’s easier to identify one relevant document with system B than with system A.

Q5. Overall I prefer to use system A over system B.

5.5.2 Hypotheses

We developed five hypotheses corresponding to these five questions. (The phrase

“system B” was replaced by “system C” for experiment II.)

H1. System A allows users to identify relevant documents more easily than system B.

H2. System B allows users to identify relevant documents more quickly than system A.

H3. Users can finish search tasks more quickly with system A than with system B.

H4. It is easier to identify one relevant document with system B than with system A.

H5. Overall, users prefer to use system A over system B.

5.6 Hypothesis Test Based on Questionnaire

66

Table 15 shows the means for the choice responses to each of the questions in the

questionnaire. Based on seven scale options described in section 6.5, we compute

numbers in this table by replacing “strongly disagree” with 1, “strongly agree” by 7, and

so on.

Table 15.

Mean Responses to Questionnaire Items. Degrees of Freedom: 13 for Computer and 19

for Finance in Experiment I; 15 for Computer and 19 for Finance in Experiment II.

Experiment Domain Q1 Q2 Q3 Q4 Q5I

(PCAT vs. LIST)Computer 6.21*** 2.36*** 5.43* 2.71* 5.57**Finance 5.25 3.65* 5.45*** 3.65** 5.40**

II(PCAT vs. CAT)

Computer 6.25*** 2.00*** 6.06*** 2.50*** 6.31***Finance 6.20*** 1.90*** 6.20*** 2.65* 6.50***

*** p < 0.001, ** p < 0.01, * p < 0.05.

As each question in Section 6.5.1 corresponds to a hypothesis in Section 6.5.2, so

we conducted a two-tailed t-test based on subjects’ responses to each question to test the

hypotheses. We calculated p-values by comparing the subjects’ responses with the mean,

“neither agree nor disagree” that had a value of 4. The table shows that for both computer

and finance domains, H1, H3, and H5 are supported with at least 95% significance, and

H2 and H4 are not supported.18 The only exception to these results is that we find only

90% significance (p = 0.083) for H1 in the finance domain of experiment I. According to

18 For example, the mean choice in the computer domain for H2 was 2.36 with p < 0.001. According to our scale, 2 means “disagree” and 3 means “mildly disagree,” so a score of 2.36 indicates subjects did not quite agree with H2. Hence, we claim that H2 is not supported. The same is true for H4.

67

these responses on the questionnaire, we conclude that users perceive PCAT as a system

that allows them to identify relevant documents more easily and quickly than LIST or

CAT.

Several results reported in a recent work [Käki 2005] are similar to our findings.

In particular,

Categories are helpful when document ranking in a list interface fails, which fits

with our explanation of why PCAT is faster than LIST for short queries.

When desired results are found at the top of the list, the list interface is faster, in

line with our result and analysis pertaining to Site Finding tasks.

Categories make it easier to access multiple results, consistent with our report for

the Information Gathering tasks.

However, the categorization employed in [Käki 2005] does not use examples to

build a classifier. The author simply identifies some frequent words and phrases in search

result summaries and uses them as category labels. Hence, each frequent word or phrase

becomes a category (label). A search result is assigned to a category if the result’s

summary contains the category label. Käki [2005] also does not analyze or compare the

two interfaces according to different types of tasks. Moreover, Käki [2005: Figure 4]

shows, though without explicit explanations, that categorization is always slower than a

list. This result contradicts our findings and several prior studies [e.g., Dumais and Chen

2001]. We notice that the system described by Käki [2005] uses a list interface to show

the search results by default, so a user may always look for a desired page from the list

interface first and switch to the category interface only if he or she does not find it within

a reasonable time.

68

5.7 Comparing Indices of Relevant Results

To better understand why PCAT was perceived as faster and easier to use by the

subjects as compared with LIST or CAT, we looked at the indices of relevant results in

the different systems. An expert from each domain completed all search tasks using

PCAT and LIST. Using the relevant results identified by them, we compare the indices of

the relevant search results for the two systems, as we show in Figs. 10-1 and 10-2.

0

5

10

15

20

25

30

35

1 6 11 16 21 26

Task

Inde

x

PCAT (Computer) LIST (Computer)

Figure 14-1. Indices of relevant results in PCAT and LIST (computer domain)

69

0

5

10

15

20

25

30

35

40

45

1 6 11 16 21 26

Task

Inde

x

PCAT (Finance) LIST (Finance)

Figure 14-2. Indices of relevant results in PCAT and LIST (finance domain)

We sort the tasks by the index differences between LIST and PCAT in ascending

order. Thus, the task numbers on the x-axis are not necessarily the original task numbers

in our experiments. Because PCAT organizes the search results into different categories

(interests), the index of a result reflects the relative position of that result under a

category. In LIST, a relevant result’s index number equals its relative position on the

particular page on which it appears plus ten (i.e., the number of results per page) times

the number of preceding pages. Thus, a result that appears in the fourth position on the

third page would have an index number of 24 (4 + 10 × 2). If users had to find two

relevant results for a task, we took the average of the indices. In Figure 14-1, PCAT and

LIST share the same indices in 10 of 26 tasks, and PCAT has lower indices than LIST in

15 tasks. In Figure 14-2, PCAT and LIST share the same indices in 7 of 26 tasks, and

PCAT has smaller indices than LIST in 18 tasks.

70

Similarly, Figs. 11-1 and 11-2 show indices of the relevant search results of

PCAT and CAT in experiment II. The data for PCAT in Figs. 11 are same as those in

Figs. 10, and we show tasks by the index differences between PCAT and CAT in

ascending order. In Figure 15-1 for the computer domain, PCAT and CAT share same

indices in 15 of 26 tasks, and CAT has lower indices in 6 tasks. In Figure 15-2 for the

finance domain, the two systems share same indices in 10 of 26 tasks, and CAT has lower

indices in 14 of 26 tasks.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 6 11 16 21 26

Task

Inde

x

PCAT (Computer) CAT (Computer)

Figure 15-1. Indices of relevant results in PCAT and CAT (computer domain)

71

0

1

2

3

4

5

6

7

8

9

1 6 11 16 21 26

Task

Inde

x

PCAT (Finance) CAT (Finance)

Figure 15-2. Indices of relevant results in PCAT and CAT (finance domain)

The indices for PCAT in Figs. 10 and 11, and CAT in Figs. 11 reflect an

assumption that a user first jumps to the right category and then finds a relevant page by

looking through the results under that category. This assumption may not always hold, so

Figs. 10 may be optimistic in favor of PCAT. However, if the time taken to locate the

right category is not large (as probably in the case of PCAT), the figures provide a

possible explanation for some of the results we observe, such as the lower search times

for PCAT with one-word query and Information Gathering tasks in Experiment I.

However, CAT has smaller index numbers for relevant results than PCAT, which may

seem to contradict the better performance (lower search time) for PCAT in experiment II.

We note that due to its non-personalized nature, CAT has a much larger number of

potential categories as compared to PCAT. Therefore, a user can be expected to take

longer time to locate the right category (before jumping to the relevant result in it) as

compared to PCAT.

72

5.8 CONCLUSIONS

This article presents an automatic approach to personalizing Web searches given a

set of user interests. The approach is well suited for a workplace setting, where

information about professional interests and skills can be obtained automatically from an

employee’s resume or a database using an IE tool or database queries. We present a

variety of mapping methods, which we combine into an interest-to-taxonomy mapping

framework. The mapping framework automatically maps and resolves a set of user

interests with a group of categories in the ODP taxonomy. Our approach then uses data

from ODP to build text classifiers to automatically categorize search results according to

various user interests. This approach has several advantages, in that it does not (1) collect

a user’s browsing or search history, (2) ask a user to provide explicit or implicit feedback

about the search results, or (3) require a user to manually specify the mappings between

his or her interests and taxonomy categories. In addition to mapping interests into

categories in a Web directory, our mapping framework can be applied to other types of

data, such as queries, documents, and e-mails. Moreover, the use of taxonomy is

transparent to the user.

We implemented three search systems: A (personalized categorization system,

PCAT), B (list interface system, LIST,) and C (non-personalized categorization system,

CAT). PCAT followed our proposed approach and categorized search results according

to a user’s interests, whereas LIST simply displayed search results in a page-by-page list,

similar to conventional search engines, and CAT categorized search results using a large

73

number of ODP categories without personalization. We experimentally compared two

pairs of systems with different interfaces (PCAT vs. LIST and PCAT vs. CAT) in two

domains, computer and finance. We recruited 14 subjects for the computer domain and

20 subjects for the finance domain to compare PCAT with LIST in experiment I, and 16

in the computer domain and 20 in finance to compare PCAT with CAT in experiment II.

There was no common subject across the experiments. Based on the mean log search

times obtained from our experiments, we examined search tasks associated with four

types of queries. We also considered different types of search tasks to tease out the

relative performances of the compared systems as the nature of task varied.

We find that PCAT outperforms LIST for searches with short queries (especially

one-word queries) and for Information Gathering tasks; by providing personalized

categorization results, PCAT also is better than CAT for searches with free-form queries

and for both Information Gathering and Finding tasks. From subjects’ responses to five

questionnaire items, we conclude that, overall, users identify PCAT as a system that

allows them to find relevant pages more easily and quickly than LIST or CAT.

Considering the fact that most users (even non-casual users) often cannot issue

appropriate queries or provide query terms to fully disambiguate what they are looking

for, a PCAT approach could help users find relevant pages with less time and effort. In

comparing two pairs of search systems with different presentation interfaces, we realize

that no system with a particular interface is universally more efficient than the other, and

the performance of a search system depends on parameters such as the type of search task

and the query length.

74

5.9 LIMITATIONS AND FUTURE DIRECTIONS

Our search tasks were generated on the basis of user interests. We realize some

limitations of this experimentation setup in adequately capturing the work-place scenario.

The first limitation is that some of the user interests may not be known in a real-world

application, and hence some search tasks may not reflect the known user interests.

Secondly, a worker may search for information that is un-related with his or her job. In

both of these cases tasks may not match up with any of the known interests. However,

these limitations reflect a general fact that personalization can only benefit based on what

is known about the user. A future direction of research is to model the dynamics of user

interests over time.

For the purposes of a comparative study, we carefully separated the personalized

system (PCAT) from the non-personalized (CAT) by maintaining a low overlap between

the two systems. This allows us to understand the merits of personalization alone.

However, we can envision a new system that is a combination of the current CAT and

PCAT systems.

In particular, the new system replaces the “Other” category in PCAT by adding

categories of ODP that match the results that are currently placed in the “Other” category.

A study of such a PCAT+CAT system can be a future direction for this research. An

interesting and related direction is a smart system that can automatically choose a proper

interface (e.g., categorization, clustering, list) to display search results on the basis of the

nature of the query, the search results, and the user interest profile (context).

75

As shown in Figs. 5 and 8, for PCAT in experiments I and II and CAT in

experiment II, we rank the categories alphabetically but always leave the “Other”

category at the end.19 There are various alternatives for the order in which categories are

displayed, such as by the number of (relevant) results in each category or by the total

relevance of results under each category. We recognize that choosing different methods

may provide different individual and relative performances. Also, CAT tends to show

more categories on the main page than PCAT. On one hand, more categories on a page

may be a negative factor for locating a relevant result. On the other hand, more categories

provide more results in the same page which may speed up the discovery of a relevant

result as compared to clicking a “More” link to open another window (as in PCAT

system). We think that the issues of category ordering and number of categories on a

page deserve further examination.

From the subjects’ log files we observed that some of the subjects could not find a

relevant document under a relevant category due to result misclassification, they moved

to another category or tried a new query. Such a situation can be expected to increase the

search time for categorization based systems. Thus, another direction of future research is

to compare different result classification techniques based on their effect on mean log

search time.

It would be worthwhile to study the performance of result categorization using

other types of data such as title and snippets (from search engine results) instead of page

content, which would save the time on fetching Web pages. In addition, it may be

interesting to examine how a user could improve his or her performance in Internet

19 For the computer domain in experiment I, PCAT shows C++ and Java before other alphabetically ordered interests, and the “Other” category is at the end.

76

searches in a collaborative (e.g., intranet) environment. In particular, we would like to

measure the benefit the user can derive from the search experiences of other people with

similar interests and skills in a workplace setting.

ACKNOWLEDGMENTS

During the program development, in addition to the software tools mentioned in

prior sections of the paper, we employed BrowserLauncher20 by Eric Albert and XOM21

(XML API) by Elliotte Rusty Harold. We thank them for their work. We would also like

to thank the Associate Editor and anonymous reviewers for their helpful suggestions.

20 http://browserlauncher.sourceforge.net/.21 http://www.cafeconleche.org/XOM/.

77

DRAFT DISSERTATION PART II BUSINESS RELATIONSHIP DISCOVERY

In Part II, Chapter 6 covers research motivation, presents our approach at a high

level, and reviews prior literature. Chapter 7 introduces network-based attributes and

explains data and data processing related topics. These two chapters are fundamental to

the next two chapters. Chapter 8 studies CRR prediction and Chapter 9 focuses on

competitor discovery. Hereafter, we use the following pairs of terms interchangeably:

network and graph, node and company, link and company pair or pair of companies.

6 INTRODUCTION AND LITERATURE REVIEW

6.1 Introduction

Business news contains rich and current information about companies and the

relationships among them. Online business news from media companies (e.g., Reuters),

content providers (e.g., Yahoo!), and company Web sites offer readers timely

assessments of dynamic company relationships. The task of reading news is very time

consuming and it requires a reader to possess certain skills, the most basic of which is a

good understanding of the language in which the news is written. However, the huge

volume of news stories makes the manual identification, without automated news

analysis, of relationships among a large number of companies nontrivial and unscalable.

78

For professional or personal finance–related interests, many people regularly spend

significant amounts of time scanning the news to monitor recent companies’ financial

milestones. For tasks such as investment or market research, researchers often need to

compare a pair of companies or identify top-performing companies on the basis of

revenue. The company revenue relationships are dynamic and information about them

may not be readily or continuously available. Public companies typically update their

earning or balance sheet data on a quarterly basis, whereas the availability of private,

initial public offering (IPO), or foreign companies’ financials is more limited overall.

Scanning the competitive environment of a company or a group of companies is

essential for supply chain, marketing, investment and strategic partnership

management. Once its competitors have been identified, a company can look for their

product lines, marketing strategies, directions of R&D, key personnel, customers, and

suppliers, and so on to potentially improve its competitive advantage. Analysts and

managers may resort to various options for discovering and monitoring competitor

relationships. These options may include: asking business associates (e.g., customers

or suppliers), reading news, searching on the Web, attending business conventions,

and looking through company profile resources such as Hoover’s22 and Mergent23.

While the availability of company profiling resources has reduced the search effort

and made some of business relationship information easily accessible, the other

above-mentioned approaches, due to their largely manual nature, are still time

consuming and limited in scale. Besides, using possibly different criteria in collecting

and identifying information, businesses that provide company profiles also suffer

22 Hoover’s, Inc., http://www.hoovers.com.23 Mergent Inc., http://www.mergentonline.com.

79

http://www.mergentonline.com/

http://www.hoovers.com/

from the scalability problem due to limited resources, manpower and budget, leading

to incomplete and inconsistent information. For example, Hoover considers

Interchange Corp. as a competitor of Google, while Mergent doesn’t specify this

relationship. In contrast, Mergent includes Tercica Inc. as a competitor of

GlaxoSmithKline plc while Hoover’s doesn’t. Therefore, it is important to explore

approaches to automatically discover important business relationships that can

complement and extend existing time consuming efforts. An automated approach also

allows for a timely update of business relationships thus avoiding information

staleness that can mar manual approaches.

Social network analysis (SNA) refers to a set of research procedures for

identifying and quantifying structures in a social network on the basis of relationships

among the nodes [Richards and Barnett 1993]. A social network consists of a set of

nodes, such as individuals or organizations, which are connected through edges that

represent various relationships (e.g., friendship, affiliation) [Wasserman and Faust 1994]

that tend to be simple to identify and yet voluminous to analyze. It is feasible and

effective to discover network structures by analyzing quantitative measures of the

information represented by nodes and edges of social networks for diverse fields, such as

social and behavioral science, anthropology, psychology [Scott 2000], and information

science.

In this study, we present an approach that applies SNA and machine learning

techniques for automated discovery of business relationships. In particular, we study two

different relationships, CRR and competitor relationships, as two illustrative examples of

our approach. Figure 16 illustrates the main steps for discovery of the two relationships at

80

a high level. First with a collection of news stories that have been organized by company,

given that a news story pertaining to a company often cites one or more other companies,

we identify company citations in news stories and treat them as links from the focal

(source) companies to those cited (target) companies, and then construct a directed,

weighted intercompany network. Further we identify four types of network attributes

based on network topology. The four types of attributes differ in their coverage of the

intercompany network. Finally we feed these identified attributes to classification

methods to predict CRR and discover competitor relationship between two companies.

This approach is effective and scalable for business relationship screening, and can be

extended for automated discovery of a broad range of business relationships. Moreover,

the approach is language neutral (i.e., we do not analyze the vocabulary or grammar in

news stories to find relationships). This last feature of the approach can help extend it to

news written in languages other than English.

Figure 16. A high Level Process View for Studying CRR and Competitor

Relationship

6.2 Literature Review

81

Many researchers in areas such as organization behavior and sociology have

investigated the nature and implications of social networks created by business

relationships. For example, Levine [1972], using a network of interlocked directorates

between major banks and large industrial companies, constructs a map of the “sphere of

influence” that provides a quick (though approximate) overview of the relations (e.g.,

well-linked bank–company ties) in the network. Walker et al. [1997] examine an

interfirm network on the basis of cooperative relationships from a commercial directory

of biotechnology firms. Using regression techniques with ten independent variables, they

demonstrate that network structure strongly influences the choices of a biotechnology

startup in terms of establishing new relationships (licensing, joint venture, and R&D

partnership) with other companies. Uzzi [1999] investigates how social relationships and

networks affect a firm’s acquisition and cost of capital. Gulati and Gargiulo [1999]

demonstrate that an existing interorganizational network structure affects the formation of

new alliances which eventually modifies the existing network. A major difference

between those prior studies and ours is that prior works construct a social network using

explicit given relationships from gold standard data sources whereas we try to predict a

business relationship, i.e. CRR, between two companies using structural attributes

derived from citation based intercompany network.

Research in information retrieval and bibliometrics has previously exploited SNA

and graph-theoretic techniques on a network of documents They consider implicit

signals, such as URL links, email communications, or article citations, as links between

nodes and further study problems such as identifying importance of individual nodes in

the network [e.g. Brin and Page 1998; Kleinberg 1999; Garfield 1979] and communities

82

in Web [e.g. Kautz et al. 1997; Gibson et al. 1998], instead of discovering business

relationships between companies.

For example, articles such as scholarly publications can be considered to be

connected with one another through citations. A citation index indexes the citations

among such articles [Garfield 1979]. Using a citation index, a researcher can find not

only articles that a given article cites but also articles that cite the given article. CiteSeer

[Giles et al. 1998] is an example of an autonomous citation indexing system that

retrieves, indexes, and builds bibliographic and citation databases from research articles

on the Web. Furthermore, analyses of the networks created by citations have led to

various measures of prestige and the impact of published articles and the journals in

which they appear. Some measures closely resemble measurements of Web page

“popularity” [Brin and Page 1998] used by Web search engines such as Google.

Park [2003] identifies hyperlink network analysis as a subset of SNA, in which nodes are

Web sites and the relationships are URL links among sites. In such a network, the

linkages among sites reflect the authority, prestige, or trust of the sites [Kleinberg 1999,

Palmer et al. 2000]. Brin and Page [1998] propose the PageRank algorithm to rank the

nodes (pages) on the www network with directed URL links among pages and use the

ranks of pages to order search results. Kleinberg [1999] presents the Hyperlink-Induced

Topic Search (HITS) algorithm to compute the “hub” and “authority” importance

measures for each node (page), also based on the link structure of the www.

Bernstein et al. [2002] apply a commercial information extraction system to

extract company entities from Yahoo business news and posit that two companies have a

relationship (link) if they appear in the same piece of news (co-occurrence approach).

83

The network, which consists of 1,790 identified companies and in which links between

two companies are undirected and unweighted (binary weight), illustrates some central

industry players. They further filter out nodes in the network to produce a smaller

network with 315 companies and 1,047 links, which they use to count how many other

companies are connected with each company, rank all companies by the counts, and

indicate that some of the 30 top-ranked companies in the computer industry are also

Fortune 1000 companies. Hence, their result indicates that companies with high revenues

tend to be linked to many other companies in a network derived purely from news stories.

Their work is somewhat similar to our study, in that they use online business news to

construct an intercompany network. However, unlike Bernstein et al. [2002], we qualify

links in the constructed network by both direction and weights. Furthermore, different

from the abovementioned research we employ various graph-based metrics to predict the

CRR between any pair of companies linked in the network that contains tens of thousands

of such company pairs.

84

7 NETWORK-BASED ATTRIBUTES AND DATA

In this section, we first introduce relevant notation in directed graphs, followed by

notation in directed, weighted graphs. Then we describe data and data processing

procedures. To provide statistical insights into the data, we report distributions of the

various network attributes.

7.1 Notation in Directed Graphs

Figure 17. Directed Graph

Figure 17 presents a directed graph (digraph) that consists of four nodes joined by

eight directed links. More formally, a digraph Gd = (N, L) consists of a set of nodes N and

a set of links L, where

N = (n1, n2, …, nm) and

L = (l1, l2, …, lk), where li = <nsource, ntarget>.

The node indegree, NID(ni), in a digraph is the number of nodes linked to ni; the

node outdegree, NOD(ni), is the number of nodes linked from ni [Wasserman and Faust

85

1994]. Node indegree, or a metric based on it, has been used often to represent authority

and prestige in many prior works [e.g., Brin and Page 1998, Kleinberg 1999]. In this

figure NID(n1) and NOD(n1) are 3 and 2, and NID(n4) and NOD(n4) are 1 and 2.

7.2 Notation in Directed, Weighted Graphs

Web portals such as Yahoo! Finance and Google Finance provide news stories

arranged by company. A news story pertaining to a company (source company) often

cites one or more other companies referred to as target companies. we consider that the

company citation is a directed link (outlink) from the source company to a target and

each citation adds a unit of weight to the link. Finally the link weight between the two

companies is the accumulated citation count across a set of news stories.

Figure 18 depicts a digraph in which each link carries a weight. It is a very small

portion of the intercompany network that consists of five companies/nodes joined by 15

directed and weighted links. More formally, a weighted digraph Gwd = (N, L, W) includes

N, L, and a weight vector W associated with the set of links, where W = (w1, w2, …, wk).

We derive various attributes from the intercompany network that characterize

either a node (one value for each node) or a pair of nodes (one value for each pair). we

divide the various attributes into four types (see Table 16) on the basis of the range of the

network covered for computing the attributes and describe these attributes as follows.

86

Figure 18. Directed, Weighted Graph

DELL: Dell Inc., INCX: Interchange Corp., GOOG: Google Inc., JPM: JP Morgan Chase

& Co., YHOO: Yahoo! Inc.

7.2.1 Dyadic and Node Degree-based Attributes

We first introduce a group of dyadic degree-based attributes as follows.

Dyadic weighted indegree (DWID), DWID(ni, nj) is the weight of the link from nj

to ni.

In Figure 18 the DWID(YHOO, GOOG) is 478.

Dyadic weighted outdegree (DWOD), DWOD(ni, nj) is the weight of the link

from ni to nj.

Again, based on Figure 18, the DWOD(YHOO, GOOG) is 512. We note that both

DWID(GOOG, YHOO) and DWOD(YHOO, GOOG) are large (as compared to

other pairs) and almost equal values. News stories about two competing

companies can be expected to frequently cite the other company and the volume

87

of citations for each company can be expected to be almost equal when there is no

absolute winner (e.g., monopoly).

Dyadic weighted netdegree (DWND)

DWND(ni, nj) = DWOD(ni, nj) – DWID(ni, nj) (1)

Hence, DWND(YHOO, GOOG) = 512 – 478 = 34 shows a net flow of citations in

the direction of pointing to GOOG when we consider the pair <YHOO, GOOG>.

The positive net flow to GOOG may indicate its slight dominance as reflected by

news citations.

Dyadic weighted inoutdegree (DWIOD)

DWIOD(ni, nj) = DWOD(ni, nj) + DWID(ni, nj) (2)

Again, DWIOD(YHOO, GOOG) = 990, which is a relatively large as compared

to other links in the example network. A large DWIOD value may indicate a strong

relationship between the given pair of companies.

The dyadic nature of these attributes captures the flow of citations and hence

potential relationships between a pair of companies. However, dyadic attributes consider

only a pair of connected nodes. To take into account a given node’s neighbors, we

consider the following node degree-based attributes.

Node weighted indegree (NWID)

88

NWID(ni) = (3)

This measures the flow of citations from all companies in the network to the

given company. We expect “important” companies to possibly draw a large total

number of citations in news from other companies.

Node weighted outdegree (NWOD)

NWOD(ni) = (4)

This measures the flow of citations from the given company to all other

companies in the network.

Node weighted inoutdegree (NWIOD)

NWIOD(ni) = (5)

This measures the overall flow of citations both to and from the given company

(ni). In essence, this attribute measures the overall connectivity of the given company

and all neighbor companies in the network independent of the direction of citations.

In Figure 18 for node n1 (YHOO) the NWID, NWOD, and NWIOD values are

513, 541, and 1054 respectively. If a pair of companies has a large DWIOD value as

well as large individual NWIOD values, it may suggest that the two companies have a

strong relationship and are both important players.

89

7.2.2 Centrality-based Attributes

In addition to the dyadic and node degree-based measurements, we also use a

network analysis package [JUNG 2006] to compute scores on the basis of three different

centrality/importance measuring schemas: PageRank [Brin and Page 1998], HITS

[Kleinberg 1999], and betweenness centrality [Brandes 2001]. These schemas extend

beyond immediate neighbors to compute the importance or centrality of a given node in

the whole network. The PageRank algorithm computes a popularity score for each Web

page on the basis of the probability that a “random surfer” will visit the page [Brin and

Page 1998]. The HITS algorithm generates a pair of scores, “hub” and “authority,” for

each page. Both HITS and PageRank compute principal eigenvectors of matrices derived

from graph representations of the Web [Kleinberg 1999], so our use of them for a graph

whose nodes are companies differs from their original use. As a node centrality

measurement, betweenness measures the extent to which a node lies between the shortest

paths of other nodes in the graph [Freeman 1979]. The three schemas do not consider link

weights. JUNG [2006] provides the node authority scores for HITS and ignores the link

direction when computing betweenness centrality. The intuition behind these global

centrality attributes is the same as that for the node degree based attributes but the former

are more informative since they consider the entire network for computation instead of

focusing on immediate neighbors.

90

7.2.3 Structural Equivalance (SE) based Attributes

Lorrain and White [1971] identify two nodes to be structurally equivalent if they

have the same links to and from other nodes in the network. As it is unlikely that two

nodes will be exactly structurally equivalent in our intercompany network, we use a

similarity metric to measure the degree to which two nodes are structurally equivalent.

The intercompany network is represented as a weighted adjacency NxN matrix, where N

is the number of nodes. The SE similarity between two nodes is the normalized dot

product (i.e., cosine similarity) of the two corresponding rows in the matrix, where a

matrix element can be DWID, DWOD, or DWIOD value and therefore producing

DWID-, DWOD-, or DWIOD-based SE similarity. Intuitively, the DWID-based SE

similarity between company A and company B captures the overlap between companies

whose news stories cite A and companies whose news stories cite B (analogous to co-

citation [Small 1973]); the DWOD-based SE similarity reflects the overlap between

companies that news stories of A and B cite (analogous to bibliometric coupling [Kessler

1963]). A high overlap between neighbors of two nodes in our intercompany network

may be reflective of the overlap in their businesses or markets. Intuitively, this

phenomenon may indicate a competitor relationship. For example, for the sample graph

of Figure 18 DWID-based SE similarity between n1 and n3, or YHOO and GOOG, is 0.98

out of 1 for the maximum possible value.

For classifying whether a pair of companies are competitors we use the above

described attributes. As noted earlier, some of the attributes have one value for a pair of

nodes (DWID, DWOD, DWIOD, and three different SE similarities) while others have a

value for each node (NWID, NWOD, NWIOD, pagerank, hits, and betweenness) in the

91

pair. Hence, we use the total of 18 attributes for classifying competitor relationship for a

company pair. Table 16 summarizes these attributes by type and range of network

covered.

Table 16.

Four Types of Network Attributes

Attribute Type Attributes Range of Network CoveredDyadic degree-based DWID, DWOD, DWIOD A given node and only one directly connected nodeNode degree-based NWID, NWOD, NWIOD A given node and all directly connected nodesNode centrality-based pagerank, hits, betweenness Whole networkSE-based DWID-, DWOD-, DWIOD-

based SE similarityAny two nodes and their directly connected nodes in the whole network

7.4 Raw Data

Now we describe the source and nature of the raw data (news stories) and the

process by which we constructed the intercompany network from them. The first data set

consists of eight months (July 2005–February 2006) of business news for all companies

on Yahoo finance [Yahoo]. Both Chapter 8 (predicting CRR) and Chapter 9 (Discovering

Competitor Relationships) use this dataset. In addition, Chapter 8 uses three more

months’ (March–May 2006) news stories from the same data source as a second data set

to validate the major results we obtain from the first. We include all companies across all

nine sectors (Basic Materials, Conglomerates, Consumer Goods, Financial, Healthcare,

Industrial Goods, Services, Technology, and Utilities) in Yahoo finance whose annual

revenue records appeared in the company statistics section in Yahoo finance as of early

92

April 2006 for the first data set and mid-June 2006 for the second data set. The revenue

values represent total revenues in the previous four quarters. So in Chapter 8 we predict

revenue relations using news collected before the revenue records become available.

7.5 Preliminary Data Processing

Yahoo finance organizes business news stories by company and date. The news

stories are not limited to those available from yahoo.com but also include those from

other news sources, such as forbes.com, thestreet.com, and businessweek.com. In other

words, URL links corresponding to news titles that have been organized under a company

in Yahoo finance may point to Web pages located at several domains. Taking advantage

of this organizing mechanism provided by Yahoo, we identify all news pertaining to a

given company within a period of time. For example, for news belonging to Google and

dated February 28, 2006, a page containing both all news titles and their URLs linking to

news content is at http://finance.yahoo.com/q/h?s=GOOG&t=2006-02 -28, where GOOG

is the stock ticker of Google Inc. We automatically construct similar URLs to gather links

of news stories for each company in Yahoo finance across the eight- and three-month

periods that constitute our two data sets. We then programmatically fetch news stories

corresponding to the links. Yahoo may organize the same piece of news under different

companies; we treat such a news story as belonging to each of the companies that Yahoo

identifies.

93

http://finance.yahoo.com/q/h?s=GOOG&t=2006-02%20-28

7.6 Node and Link Identification

A news story identifies a company according to its stock ticker on NYSE,

NASDAQ or AMEX. If a piece of news pertaining to a company ni mentions another

company nj, we consider there is a directed link from n i to nj, denoted as <ni, nj>. If

company nj is cited several times in the same piece of news, each citation adds to the

accumulated weight for the directed link. We aggregate citation frequency across all

news stories in a data set. Furthermore, we do not count self-references; therefore, we

ignore citations to company ni if they appear in a news story belonging to n i. For

example, if a news story pertaining to company n1 mentions the companies in the

sequence [n2, n1, n3, n4, n4, n2, n5], we derive the set of links and weight vector as (<n1,

n2>, < n1, n3>, < n1, n4>, < n1, n5>) and (2, 1, 2, 1), respectively. We filter out news stories

that do not mention any other company. After we collected the annual revenues and news

stories for all companies across all nine sectors in Yahoo finance, we emerged with a

total of 6,428 companies and 60,532 news stories for the first data set and 6,246

companies and 36,781news stories for the second data set. For the first data set, we note

that the early months (i.e., July–September 2005) included fewer news stories than later

months, because Yahoo does not archive as many historical news stories as recent ones.

In Table 17, we provide company and news distribution across the nine sectors in the first

data set.

Table 17.

Company and News Distribution across Sectors

Sector Number of Companies

Percentage of Companies

Number of News Stories

Percentage of News Stories

94

Basic materials 522 8.12% 4398 7.27%Conglomerates 30 0.47% 1004 1.66%Consumer goods 496 7.72% 4947 8.17%Financial 1402 21.81% 5512 9.11%Healthcare 706 10.98% 7481 12.36%Industrial goods 423 6.58% 2677 4.42%Services 1334 20.75% 13144 21.71%Technology 1386 21.56% 20723 34.23%Utilities 129 2.00% 646 1.07%Total 6428 100% 60532 100%

7.7 Attribute Distributions

Several variables derived from social phenomena and networks, such as Pareto

distribution of wealth and the frequency of word usage in the English language [Adamic

2002], follow the power law distribution. Recent research shows that several aspects of

digital networks such as the Internet follow power law distributions as well. For example,

the rank and frequency of the outdegrees of Internet domains [Faloutsos et al. 1999] and

the indegree and outdegree of Web page links [Barábasi et al. 2000, Broder et al. 2000,

Kumar et al. 1999] reflect power law distributions. With the directed, weighted

intercompany network, we observe similar power law distributions for various node

degree measurements (NID, NOD, NWID, and NWOD) and link weight. These results

refer to the first data set; the second data set provides very similar results that we do not

report. All logarithms used in the distributions are base 10.

7.7.1 Node Indegree Distribution

Figure 19 shows that the distribution of node indegree (NID) follows a power law

distribution with a Pearson correlation at 0.945 (negative sign ignored). The distribution

95

indicates a few nodes (companies) attract most of the citations, similar to social

phenomena such as the distribution of wealth (Pareto distribution) [Adamic 2002]. We

observe similar power law distributions for other node degree measurements, such as

NOD, NWID, and NWOD. For brevity, we do not show their distribution plots herein.

Figure 19. Node Indegree (NID) Distribution

7.7.2 Link Weight Distribution

Figure 20 shows the link weight distribution in our intercompany network. The

link weight also follows the power law distribution with a Pearson correlation at 0.944.

The power law distribution of link weights indicates there are a few very strong links and

many weak ones.

96

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Log(Weight)

0.00

1.00

2.00

3.00

4.00

5.00

Log(

coun

t)

Figure 20. Link Weight Distribution

7.7.3 Revenue Distribution

We choose a million of dollars as the unit to record the revenue for each

company, group companies with similar logged revenues, and obtain the histogram in

Figure 21, which shows that the (logged) revenues across the 6,428 companies

approximately follow a normal distribution.

97

-1.05-.75-.50-.25.00.25.50.751.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.005.255.50

Log(Revenue)

0

100

200

300

400

500

600

Cou

nt

Figure 21. Revenue Distribution

7.7.4 Revenue Node Weighted Indegree Distribution

-4.00 -2.00 0.00 2.00 4.00 6.00

Log(Revenue)

0.00

1.00

2.00

3.00

4.00

Log(

WID

)

Figure 22. Scatter plot of revenue and node NWID

98

Figure 22 represents a plot of the logged revenues and logged node NWID of all

nodes, with a Pearson correlation of 0.534. Unlike the prior three subsections, we find no

clear pattern for the two variables. In addition, we observe similar distributions for

logged revenue with NID, NOD, and NWOD.

99

8 PREDICTING COMPANY REVENUE RELATIONS

As explained in Chapter 7.6, in our approach nodes in an intercompany network

consist of companies mentioned in business news stories. When determining a link

between two nodes, unlike traditional SNA that uses explicit social relationships (e.g.,

common directorship [Levine 1972], cooperative business relationships [Walker et al.

1997]), we assume a directed link from company A to company B if a news story

pertaining to the company A mentions (cites) company B. Moreover, a link from

company A to company B carries a weight that equals the total number of citations for

company B in a set of news stories belonging to company A. The direction and weight

should provide additional information about the flow and strength of business

relationships in the constructed network. Also, by noting the direction, we can examine

the effects of links coming into a node and those going away from it separately. The

weights in our network reflect the accumulated citations between a pair of companies and

enable us to quantitatively identify a relationship between two companies over time. We

identify a “netdegree” measurement (DWND) that combines the direction and weights to

provide an overall view of the relationship between a pair of companies. Hence, this

approach is more comprehensive than prior related literature on several dimensions,

including a richer network (with weights and direction), a new degree-based metric,

larger data sets, and various analyses related to business relationship prediction.

To illustrate business relationship prediction, in this chapter we focus on

predicting a (positive or negative) CRR between any pair of linked companies and further

100

estimate whether a company’s revenue is in the top-N (where N varies from 100 to 1000)

companies on the basis of the network structure. Before we present our research

questions in detail, we first describe how we measure CRR.

8.1 Measurements for CRR

As we mentioned in the introduction, a positive or negative revenue relation exists

between a pair of companies. However, when the two companies come from different

sectors, their (absolute) revenue values may not be comparable. Therefore, we derive the

following three metrics to determine a positive or negative CRR by taking the size of a

sector into consideration:

Revenue rank, or the rank of the company’s revenue in its sector, namely,

revenue rank(ni) [1, |sector(ni)|], where revenue rank(ni) is company ni’s rank order

in its sector by revenue and |sector(ni)| is the total number of companies in the sector

to which company ni belongs.

Normalized revenue rank(ni) = (6)

Revenue share(ni) = (7)

where revenue(ni) is company ni’s revenue value (in dollars).

101

In Chapter x, we report the detailed results measured by revenue ranks and briefly

mention some results generated by the other two metrics; the results measured by those

two metrics are very similar to those measured by revenue ranks.

8.2 Research Questions

We want to explore the broad hypothesis that attributes derived from a network

constructed from news stories can indicate meaningful business relationships (in

particular, CRR and top-N by revenue). Therefore, we identify attributes that capture the

pairwise relationships between companies (dyadic degree-based) or estimate the

individual importance of each company (node degree-based and node centrality-based).

In each case, the attributes are computed purely from weighted and directed links formed

by citations in news stories. In turn, based on the problem described previously and the

identified network-based attributes, we ask the following specific research questions:

1. Is DWND, which captures the net flow of citations between a pair of companies,

an effective indicator of positive CRR?

2. How well can the attributes derived purely from network structure, as shown in

Table 16, predict CRR for a pair of companies in the network?

3. How does CRR prediction performance differ among the three groups of

attributes (Table 16), which represent different amounts of network covered?

4. How well can individual importance measures of each company, such as node

degree- and centrality-based attributes, predict top-N revenue companies?

102

5. Which of the network structure-based attributes (when combined linearly) are

significant in distinguishing positive and negative CRR?

8.3 Research Methods

Figure 23. Diagram of Methodology and Analysis Approaches

As we discuss in subsequent sections, our methodology, depicted in Figure 23,

generates a directed and weighted intercompany network from business news and uses

the network to address the research questions. For our analysis with pairs of companies,

we use DWND to identify the source and target and ensure each pair is selected only

once: If (ni, nj) is identified as a pair, (nj, ni) cannot be selected. We sort all the links by

their DWND values in descending order and consider only those links whose DWND

values are greater than or equal to 0. For any link <n i, nj> in the network with a DWND

value of 0, we ignore the opposite link <nj, ni>. For the two data sets, we identify 87,340

and 46,725 company pairs, respectively, and use these to predict CRR; we also predict

103

the top-N companies by revenue and note that the ranges of netdegree values are 0–49

and 0–101 for the first and second data sets, respectively.

8.3.1 Classification Methods

Using Weka [Witten and Frank 2005] as a data analysis tool, we employ two

classification methods to evaluate the CRR prediction performance for company pairs

and top-N by revenue. For our classification methods, we select logistic regression and

C4.5 [Quinlan 1993] decision tree (i.e., J48 classifier in Weka). Logistic regression is

frequently used in business research for problems with a binary class label (as for our

CRR and top-N prediction problems); decision tree is one of the commonly used

classifiers in data mining, because it is highly accurate for binary classification problems,

it does not impose assumptions about the distribution of data, and its results are well

suited for human interpretation [Padmanabhan et al. 2006]. We use two different methods

so we may compare their performances for our applications. We also employ artificial

neural network (ANN) as a third classification method and find it offers similar results to

those provided by the decision tree. Therefore, we do not include the results obtained

using ANN. When using each of the classification methods, we employ 10-fold cross-

validation for performance measurements. In a 10-fold cross-validation the data is split

into ten disjoint and equal-size subsets, and nine of the subsets are used for training while

the remaining one is used as a holdout for validation. This process is repeated ten times to

find a robust performance measurement of predictive models. Our performance results

are the average of the ten validations [Michael 1997]. In line with standard metrics used

104

in data mining and information retrieval, we report the average precision, recall, and

accuracy to evaluate the performance of the predictive models:

(8)

(9)

(10)

8.3.2 Discriminant Analysis with Logistic Regression

The main purpose of this paper is to explore the power of structural attributes of

the intercompany network, obtained from news, in predicting CRR. However, we would

also like to investigate the significance (if any) of individual attributes (independent

variables or IVs) in discriminating between positive and negative CRR. Therefore we

perform a discriminant analysis using logistic regression. The linear nature in which

attributes are combined in logistic regression allows for a simplistic understanding of

their individual significance. In particular, from the 87,340 pairs we randomly select

1000 pairs such that each company in the chosen pairs is distinct. As a result, there are

2000 unique companies in the 1000 pairs and hence these pairs are considered

independent. With 12 IVs (DWID and DWOD for source, NWID and NWOD for source

and target, pagerank, hits and betweenness scores for source and target) and CRR as the

dependent variable (DV), we employ binary logistic regression in SPSS (version 12.0) to

105

find the discriminant variables. In particular, we start with a base model that uses the

mean of the DVs and does not include any IVs. Then from a list of candidate IVs which

have statistically significant differences between the two DV groups, we add an

additional IV at one step by choosing the IV having the largest score statistics (method

“Forward: LR” in SPSS) until the stepwise estimation procedure stops (e.g., no remaining

IV is significant) [Hair et al. 2006].

8.4 Results and Analyses

In this section, we first explore how DWND is associated with CRR by

determining whether the net flow of news citations between a pair of companies indicates

the relative size of their revenue. we analyze this attribute since it seems to capture the

overall flow of citations (and importance) from one company to another. Therefore, we

explore whether this importance based on citations reflects the revenue relation between a

pair of companies. we then examine how well the various attributes derived from network

structure predict CRRs for company pairs. To tease out the effects of the three different

sets of attributes—dyad degree, node degree, and node centrality—we repeat the

prediction experiment with each set of attributes separately. We also predict whether a

given company falls into the set of top-N companies by revenue, for which the

explanatory variables are based on companies’ node-level (node degree and node

centrality) attributes. Finally, we report what IVs are significant in distinguishing CRR.

106

8.4.1 Positive CRR and Top Links by DWND

We sort all of the links in the network according to their DWND attribute values

(in descending order). Using a set of the top few links from the sorted list, we compute

the percentage that correctly reflects positive CRR. We then successively increase the

number of top links (T); in Table 18, we provide the number and percentage of the top

links (where T varies from 20 to a few hundred) that follow the positive CRR. We

measure the significance of the percentages in Table 18 through a binomial test. Finally,

we note that if the DWND were independent of CRR, the percentages in Table 18 would

be close to 50%.

Table 18.

Positive CRR in Top-N links

Top Links(T)

DWND Range

Number of Links Following Positive CRR

Percentage of Links Following Positive CRR

20 [24, 49] 16 80.0% *37 [19, 49] 31 83.8% ***64 [16, 49] 50 78.1% ***79 [14, 49] 58 73.4% ***114 [12, 49] 80 70.2% ***135 [11, 49] 92 68.2% ***175 [10, 49] 115 65.7% ***217 [9, 49] 134 61.8% ***289 [8, 49] 172 59.5% ***

* p < 0.05, *** p < 0.001 (two-tailed).

When the DWND values are relatively high, DWND seems to be a good indicator

of positive revenue relations. We observe a similar result for top links in the second data

set.

107

8.4.2 Positive CRR by DWND

As the DWND value decreases, so does the signal indicating the positive CRR

between a pair of companies. To examine this observation further, we segment the links

in the intercompany network into baskets, such that links in each basket have the same

DWND, and combine links with different DWND values into one basket only if the

basket contains fewer than 20 links. In Table 19, we provide the percentages of links

following positive CRR in each basket.

Table 19.

Positive CRR for Links with the Same or Similar DWND

Basket No. DWND Percentage of Links Following Positive CRR

1 1 46.5%2 2 48.8%3 3 46.8%4 4 51.9%5 5 51.8%6 6 57.1%7 7 56.3%8 8 52.8%9 9 45.2%10 10 57.5%11 [11, 12] 55.6%12 [13, 17] 62.5%13 [18, 23] 86.7% ***14 [24, 49] 80.0% *

* p < 0.05, *** p < 0.001 ( two-tailed, binomial test).

When DWND values are small (e.g., less than 10), links in the same baskets do

not display a clear trend toward a positive CRR. In other words, for company pairs in

those baskets, pointing to a company with the same or higher revenue rank is about as

108

likely as pointing to one with lower revenue rank. However, as the DWND values

increase, positive CRR becomes more salient.

In summary DWND can be an indicator of positive CRR for top links, i.e. links

with large DWND values. Overall 48% of the 87,340 pairs whose DWND are non-

negative follow positive CRR, suggesting the indication of DWND disappears when

considering all the pairs.

8.4.3 Predicting CRR

We now attempt to predict positive or negative CRR between a pair of companies

using various attributes derived from the intercompany network. The predicted class label

therefore is a binary number whose values correspond to positive (1) and negative (0)

CRR. We first predict CRR using the attributes identified in Section 3, then split these

attributes into three subsets (Table 16) and observe their predictive power.

8.4.3.1 All Three Attribute Groups

To predict the CRR for each pair of companies, we use a total of 12 attributes (2

dyadic degree-based, 4 node degree-based, and 6 node centrality-based). For the node

degree-based and node centrality-based measures, we employ a pair of attributes for the

source and target companies of each link. Of the dyadic degree-based attributes, we do

not use DWID because it can be derived directly from DWND and DWOD (see Section

109

3.3.2). Table 20-1 shows the results of the two classification methods for the first data set

(87,340 company pairs).

Table 20-1.

Classification Results of CRR with 12 Attributes (First Data Set)

Classification Method

Class Label (CRR)

Number (Percentage) of Pairs Precision Recall Accuracy

Logistic regression

0 45398 (52.0%) 72.9% 76.2% 72.9%1 41942 (48.0%) 72.9% 69.4%

Decision tree 0 45398 (52.0%) 78.4% 79.6% 78.0%1 41942 (48.0%) 76.3% 76.9%Notes: Attributes are DWND, DWOD, source NWID, source NWOD, target NWID, target NWOD, source pagerank, source hits, source betweenness, target pagerank, target hits, target betweenness.

From Table 20-1 we observe that using attributes derived from a network

constructed from news stories, without resorting to any information about a company’s

sector or revenue, we achieve reasonable precision, recall, and accuracy of approximately

70–80% in predicting the CRR between companies. Our data set consists of an almost

equal number of positive and negative CRR instances (see the third column in Table 20-

1), so the prior probability for a link being positive or negative CRR is approximately

50%. In addition, we use revenue value, normalized revenue rank, and revenue share and

achieve very similar results to those in Table 20-1 in terms of precision, recall, and

accuracy. Finally, we divide the 87,340 pairs into two subsets: (1) all pairs in which both

companies in the pair belong to the same sector and (2) the remaining pairs (different

sectors). We examine the prediction performance for each subset separately using

110

revenue rank, normalized revenue rank, and revenue share to determine CRR, and again,

the precision, recall, and accuracy fall around the 70–80% range, similar to those in Table

20-1.

Using the ten accuracy values generated through the 10-fold cross-validation, we

find that the average accuracies of the logistic regression and decision tree differ

significantly (two-tailed t-test, p < 0.001), with decision tree proving to be a superior

method.

To check the robustness of our results, we used the same 12 attributes to predict

the CRR but now using the second data set with 46725 company pairs. Table 20-2 shows

the performance of the two different classification methods on this data set. We note that

performances for precision, recall, and accuracy were very close for those in Table 20-1.

Table 20-2.

Classification Results of CRR with 12 Attributes (Second Data Set)


Class Label (CRR)

Number (Percentage) of Pairs Precision Recall Accuracy

Logistic regression

0 23861 (51.1%) 73.6% 77.2% 74.2%1 22864 (48.9%) 74.9% 71.0%

Decision tree 0 23861 (51.1%) 77.8% 78.9% 77.7%1 22864 (48.9%) 77.6% 76.4%

8.4.3.2 Each Separate Attribute Group

111

We are also interested in comparing the performances with different groups of

attributes separately; in Tables 21, 22, and 23, we provide the associated results for the

first data set.

Table 21.

Classification Results of CRR Using DWND and DWOD


Revenue Relation Precision Recall Accuracy

Logistic regression

0 52.1% 98.7% 52.1%1 54.3% 1.6%

Decision tree 0 52.2% 91.0% 52.0%1 50.1% 9.8%

Table 22.

Classification Results of CRR Using Source NWID, Source NWOD, Target NWID, and

Target NWOD



Logistic regression

0 69.3% 82.8% 72.0%1 76.4% 60.3%

Decision tree 0 78.7 % 77.8% 77.5%1 76.3% 77.2%

Table 23.

Classification Results of CRR Using Source Pagerank, Source Hits, Source Betweenness,

Target Pagerank, Target Hits, and Target Betweenness

Classification Revenue Precision Recall Accuracy

112

Method RelationLogistic

regression0 73.2% 75.2% 72.8%1 72.4% 70.1%

Decision tree 0 77.6% 78.3% 77.0%1 76.3% 75.5%

The two dyadic degree-based attributes, DWND and DWOD, fail to predict

revenue relations well, whereas the four node degree-based and six node centrality-based

attributes produce results nearly as good those we obtain from using all 12 attributes

together. When we apply the three groups of attributes separately to the second data set,

we obtain very similar results, except that with the decision tree method, the two dyadic

attributes provide higher recalls for both positives and negatives (51%).

The poor performance of dyadic degree-based attributes may be due to their

reliance on the local (pairwise) flow of citations between the two companies. This

localized property of the dyadic attributes may fail to capture the relative importance of

the two companies, which is formed by all the citations they receive from or provide to

many other nodes in the network. The more global node degree- and node centrality-

based measures therefore better predict CRR.

8.4.4 Predicting Top-N Companies by Revenue

We now consider the related problem of predicting whether a company will fall

within the set of top-N companies by revenue (in dollars). Because we are no longer

interested in the direct relation between a pair of companies, we do not use the dyadic

attributes in these predictive methods. We employ five node-level attributes for each

113

company in the network (listed in the caption of Figure 24). The class label to be

predicted takes a value of 1 if the company is a top-N company by revenue and 0

otherwise. Again, we base all performance measurements on 10-fold cross-validation.

Figures 24 and 25 show the performances of the two classification methods as N varies

from 100 to 1000 with a step size of 100.

0%

20%

40%

60%

80%

100%

120%

100 200 300 400 500 600 700 800 900 1000

Top-N

Perc

ent Precision 0

Recall 0Precision 1 Recall 1

Notes: NWID, NWOD, pagerank, hits, betweenness.Figure 24. Precision and recall for logistic regression in predicting top-N companies

0%

20%

40%

60%

80%

100%

120%

100 200 300 400 500 600 700 800 900 1000

Top-N

Perc

ent Precision 0

Recall 0

Precision 1

Recall 1

114

Figure 25. Precision and Recall for Decision Tree in Predicting Top-N Companies

The two classification methods produce similar results. Performance for

predicting the negatives (i.e., a company is not in the set of top-N companies) is high,

with precision and recall (for both methods) in the range of 89–99%. However, precision

for predicting the positives is in the range of 57–75%, and recall is substantially low (24–

36%). We observe similar results with the second data set; for the negatives, both

precision and recall are between 88% and 99%, whereas for the positives, precision is

65–76% and recall is 22–35%. Although these positive prediction performances may

seem rather low, they should be judged with the knowledge that the top-N companies,

where N varies from 100 to 1000, constitute only 1.6–16% of the total number of

companies in the two data sets. That is, the problem of correctly identifying a company in

the set of top-N companies by revenue is particularly hard, whereas identifying a

company that is not in the top-N is easier because most companies fall into this category.

Given the high prior probability of negatives, our results for this problem are

encouraging.

8.4.5 Discriminant Variate

At the first step of the discriminant analysis, before adding the first IV into the

model, we find that ten IVs (node degree-based and centrality-based) are significant (with

significance equal to or less than 0.05) and the (two) dyadic degree-based IVs are not.

115

The result for dyadic degree-based IVs is consistent with what we see in Table 21: those

IVs produce very poor prediction results. The first IV included in the discriminant model

is source_hits score as it has the largest score statistics. After including the source_hits

and repeating the evaluation procedures, the second IV to be added is target_hits. At this

step, all the eight IVs that were significant before including the first IV become

insignificant due to a high multicollinearity among the IVs (i.e. hits, pagerank,

betweenness, NWIO and NWOD). The high multicollinearity among those IVs explains

the similar performance by different sets of IVs in Tables 22 and 23. The coefficient β for

source_hits is negative (-1863.7) and for target_hits is positive (1627.5), which indicates

that an increase in source_hits decreases the likelihood of positive power relation; and

increase in target_hits increases the likelihood of positive power relation. In other words,

global (hub-like) centrality of target company is indicative of its higher revenues and the

reverse is true for the source company. Hence, the global centrality-based hits metrics for

source and target company consist of discriminant variate that can significantly

discriminate between positive and negative CRR. The prediction results obtained using

the discriminant model (with a constant and the two IVs – source_hits and targe_hits) are

as follows:

Table 24.

Prediction Results for Discriminant Model with Two IVs

Discriminant model


Logistic regression

0 64.6% 47.3% 65.5%1 56.0% 70.5%

116

Compared with Tables 20-1, 22, and 23, Table 24 shows inferior results, indicating that

adding more IVs can improve prediction performance (the main focus of this paper).

8.5 Discussions

We propose a news-driven, SNA-based business relationship discovery approach

to explore the predictive value of business news in discerning relationships between

companies. Our approach uses citations in news stories to understand the direction and

strength of the relative importance between a pair of companies. In our intercompany

network, nodes are companies, and links are directed and weighted on the basis of the

direction and frequency of citations in news stories. We identify and quantify various

attributes of the network using standard network analysis metrics and suggest modified or

new metrics as needed (e.g., DWND). We then use these attributes to predict the (future)

relative revenue relation between a pair of companies as an example of business

relationships the approach might predict. We also investigate whether we can predict if a

given company falls into the set of top-N companies by revenue. We process and employ

two sets of multi-month data from the online business news available at Yahoo finance.

Both data sets reaffirm the robustness of our findings. Applying discriminant analysis we

identify a set of significant IVs.

Attributes derived purely from the constructed network predict CRR well, which

validates our broad hypothesis that news stories and the citations contained within them

117

provide cues about real-world relationships. Moreover, our approach is intrinsically

language independent and can be extended to news in various languages.

Similar to many other networks constructed from the Internet, we find that

various attributes of our network, such as NID, NOD, NWID, NWOD, and link weight,

follow the power law distribution. By exploring the relation between DWND and positive

CRR, we find that company pairs with large DWND tend to be associated with positive

CRR. Hence, as expected, the DWND metric (at least for large values) captures the

overall flow of revenue (importance) between a pair of companies.

We study the CRR prediction problem by using all 12 attributes at the same time,

as well as subgroups individually. The subgroups reflect the different nature of the

attributes, which vary in the range of the network covered for their computations. More

global measures, such as node degree- and node centrality-based attributes, are better

predictors of CRR than are the dyadic degree-based attributes that concentrate only on

pairwise relationships and ignore the rest of the network.

With regard to predicting whether a company’s revenue falls among the top-N,

the precision for predicting the positives (top-N) is much higher than the recall. These

results may seem humble until we consider them in the context of the prior distributions

in the data sets. Considering that only a small percentage of companies fall into the set of

top-N companies by revenue, a precision value in the range of 57–75%, as we achieve, is

encouraging. If our predictive models randomly assign companies to the top-N, the

precision for predicting positives should not exceed 16%. With discriminant analysis on

12 IVs we identify two global contrality-based IVs (hits scores for source and target) are

significant in distinguishing the CRR.

118

Our approach thus can not only serve as a data filtering step for analysts but also

be useful for tracing and monitoring the dynamics of revenue relations for many

companies over time. After validating the value of news in discerning meaningful, real-

world relationships, we continue to explore richer business relationships such as

competitors with the same network construction approach. Our preliminary results with

this new relationship prediction problem (i.e., predicting competitors) have been very

encouraging, and we plan to further validate our approach with a variety of business

relationships, news from different languages (and countries), various types of companies

(e.g., private versus public), and over time. Further research also might attempt to derive

and evaluate additional graph attributes that synthesize the global and dyadic measures

that represent more effective predictors of business relationships between a pair of

companies.

119

9 DISCOVERING COMPETITOR RELATIONSHIPS

9.1 Approach Outline and Research Questions

Figure 26 outlines the five main steps of our approach on competitor discovery.

The first two steps have been explained at Chapter 6.1. In step 3, as a preliminary

investigation, we first examine the citation-based intercompany network for both its

competitor coverage (coverage of known competitors) and competitor density (the

likelihood of finding competitors among the linked company pairs in the network.) We

benchmark this preliminary investigation against an exhaustive as well as a random

search to provide a comparative analysis of a citation-based intercompany network in

terms of search cost. We find that competitor relationship discovery is especially

challenging in portions of our data set where the number of non-competitor pairs

overwhelm the number of competitor pairs. We use a combination of data from Hoover’s

and Mergent as our gold standards for evaluation purposes.

Figure 26. Process View of the Approach

120

This study focuses on the following two research questions:

1. How well can we discover competitor relationships between companies using four

types of attributes derived from the intercompany network? In particular, using

special classification techniques suited for imbalanced dataset, we report the

classification performance for imbalanced data set.

2. To what extent can a gold standard cover the set of all competitors and to what

extent does our approach extend the knowledge (i.e. competitors) covered by a

gold standard? We use Hoover’s and Mergent as gold standards for identifying

competitors. However, we are keenly aware that these data sets are not complete

or consistent as illustrated by the examples earlier. Hence, we try to estimate their

coverage on all competitor pairs and also propose metrics to estimate how much

our approach extends the knowledge available in the each of the two gold

standard data sources.

9.2 Datasets

In the following two subsections, we introduce two datasets that will be used to

evaluate competitor classification performance. The first dataset represents a whole set of

pairs in the network, and the second is created to represent the imbalanced part of the

whole dataset.

9.2.1 Dataset I (Instance Selection and Labeling of Dataset I with 840 Company Pairs)

121

We first use DWND (net flow of citations between a pair of companies) to

identify all distinct (linked) company pairs in the network by including only pairs with

non-negative DWND values, and for any link <ni, nj> with a DWND value of 0, we

ignore the opposite link <nj, ni>. For example, in this way we identify a total of eight

links in Figure 18. For the whole intercompany network we identify a total of 87,340

company pairs. Next, we sort the pairs by their DWIOD values in descending order. We

find that the range of DWIOD is between 1 and 990. The reason to choose DWIOD for

ordering the company pairs is that DWIOD captures the total volume of citations between

two companies in the news. We expect that the larger the number of citations in news

stories between two companies the higher the likelihood of a business relationship

between them. We find that, in terms of DWIOD values, the data set is skewed in that

most of the company pairs have small DWIOD values. In order to examine the

competitor relationship we drill down and group company pairs with the same or similar

DWIOD values. In particular we divide all company pairs into baskets based on their

DWIOD values such that links with the different DWIOD value do not appear in the

same basket unless the basket contains fewer than 200 pairs. This procedure results in 21

baskets associated with different DWIOD values. We randomly choose 40 pairs from

each basket to create 21 sample baskets. The 840 pairs (40 x 21) consist of our dataset I

which we will use to examine the classification performance for individual baskets in

Section 5.

We manually identify whether each of the 840 company pairs in the 21 sample

baskets are competitors using Hoover and Mergent respectively. A class label of 1 is

assigned to a pair (positive instance) if we find a competitor relationship between the two

122

companies by either Hoover’s or Mergent; otherwise, a class label of 0 is assigned

(negative instance). Table 25 shows the DWIOD range and the size of each basket, the

number and percentage of competitor pairs in 21 sample baskets. The table illustrates a

general trend that a higher DWIOD value tends to be associated with a higher percentage

of competitor pairs in a sample basket. This matches our previously mentioned intuition

that as the overall volume of citations between a pair of companies increases we may

expect the companies to have a business relationship (such as being competitors) with

greater likelihood.

Table 25.

Distribution of Competitor Pairs in 21 Sample Baskets

Basket DWIOD range

Basket size

Number (percent) of positives in a

sample basket by Hoover’s

Number (percent) of positives in a

sample basket by Mergent

Number (percent) of positives by

union of Hoover’s and Mergent


intersection of Hoover’s and Mergent

1 [69, 990] 200 26(65.0%) 11(27.5%) 26(65.0%) 11(27.5%)2 [44, 68] 209 19(47.5%) 9(22.5%) 19(47.5%) 9(22.5%)3 [32, 43] 224 17(42.5%) 6(15.0%) 17(42.5%) 6(15.0%)4 [26, 31] 239 14(35.0%) 4(10.0%) 15(37.5%) 3(7.5%)5 [22, 25] 212 14(35.0%) 8(20.0%) 15(37.5%) 7(17.5%)6 [19, 21] 235 17(42.0%) 6(15.0%) 18(45.0%) 5(12.5%)7 [17, 18] 224 8(20.0%) 5(12.5%) 11(27.5%) 2(5.0%)8 [15, 16] 281 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)9 [13, 14] 389 10(25.0%) 4(10.0%) 10(25.0%) 4(10.0%)

10 12 263 16(40.0%) 3(7.5%) 17(42.5%) 2(5.0%)11 11 330 8(20.0%) 4(10.0%) 9(22.5%) 3(7.5%)12 10 410 8(20.0%) 2(5.0%) 8(20.0%) 2(5.0%)13 9 470 8(20.0%) 3(7.5%) 8(20.0%) 3(7.5%)14 8 622 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)15 7 769 10(25.0%) 3(7.5%) 11(27.5%) 2(5.0%)16 6 1,390 5(12.5%) 3(7.5%) 6(15.0%) 2(5.0%)17 5 1,543 5(12.5%) 2(5.0%) 5(12.5%) 2(5.0%)18 4 4,142 4(10.0%) 0(0.0%) 4(10.0%) 0(0.0%)19 3 4,972 2(5.0%) 2(5.0%) 4(10.0%) 0(0.0%)20 2 29,603 1(2.5%) 0(0.0%) 1(2.5%) 0(0.0%)21 1 40613 0(0.0%) 0(0.0%) 0(0.0%) 0(0.0%)

123

Total 87340 218 87 230 75

9.2.2 Datasets II and III (Instance Selection and Labeling of Dataset II with 2000

Company Paris)

In an imbalanced dataset a majority of instances are labeled as one class, while

the minority is labeled as the other class which is typically the more important class

[Kotsiantis, et al. 2006].

Several sample baskets in Table 17 show low percentages of positives and can be

considered imbalanced datasets. As prior research [e.g. Weiss and Provost 2003] and also

our results in Section 5 empirically show that typical classification methods fail to detect

the minority from an imbalanced dataset in that they generate very low precision and

recall on positives (e.g. close to 0%) which are the competitor pairs and therefore of

interest in this study. The main reason for the poor performance on positives is that the

classifiers by default maximize accuracy which in turn gives more weight on the majority

classes than on minority ones [Kotsiantis, et al. 2006]. For example, for a dataset with 1%

positives, simply assigning every instance a negative label and not detecting any positives

will achieve accuracy of 99%. To handle the imbalanced dataset problem we first create a

larger dataset, dataset II, by proportionally (according to their basket sizes) sampling a

total of 2000 pairs from the four imbalanced baskets (#18, 19, 20, and 21) because their

corresponding sample baskets have the lowest ratio of positives (≤10%).

We manually label the 2000 pairs using Mergent and Hoover’s respectively. The

numbers and percentages of competitors by different gold standards are displayed in

Table 18.

124

Table 26.

Number (percentage) of Positive Pairs in Dataset II

DWIODSamplebasket size


Hoover’s


Mergent

Number (percent) of positives by union of

Hoover’s and Mergent

Number (percent) of positives by intersection of Hoover’s and Mergent

1 1024 22 (2.1%) 15 (1.5%) 29 (2.8%) 8 (0.8%)2 747 30 (4.0%) 13 (1.7%) 39 (5.2%) 4 (0.5%)3 125 12 (9.6%) 3 (2.4%) 14 (11.2%) 1 (0.8%)4 104 15 (14.4%) 7 (6.7%) 18 (17.3%) 4 (3.8%)Total 2000 79 (4.0%) 38 (1.9%) 100 (5.0%) 17 (0.9%)

In future analysis, besides datasets I and II, we also use 17 baskets (#1 – #17) in

dataset I and all pairs in dataset II to produce estimated overall performance results. For

convenience, hereafter we call such a combination of the two datasets as dataset III that

contains 18 baskets where dataset II is the 18th sample basket.

9.3 Examining Competitor Coverage & Competitor Density of the Intercompany

Network

In this section we examine two issues - how complete the intercompany network

is in its coverage of competitor pairs (i.e., competitor coverage) measured by the links,

and what is the likelihood of being competitor pairs for the links of the intercompany

network (i.e., competitor density). Hence, we are interested in understanding the extent to

which the “competitor semantics” is embedded in the links of the constructed network.

The higher the competitor coverage and competitor density of the intercompany network

the lower the cost of searching (and classifying) for competitors using the network. We

125

benchmark the competitor coverage of the intercompany network against that of an

exhaustive network which is a clique where all nodes are linked to each other. Also we

compare the competitor density of the intercompany network with that of a random

network that has the same numbers of nodes and links as those of the intercompany

network. Table 27 includes notation for examining competitor coverage and competitor

density.

Table 27.

Notation for Competitor Coverage and Competitor Density

Notation InterpretationN Number of unique companies in a sample basket with 40 company pairs.

CL Citation-based links among the N companies in the intercompany network.

EL Exhaustive links among the N companies.CP(CL) Number of competitor pairs (CP) that present in CL.CP(EL) Number of competitor pairs that present in EL.Competitor coverage ratio

CP(CL)/CP(EL), the proportion of all known competitor pairs that are present as links in a citation-based intercompany network.

CP40(CL) Number of competitor pairs that present in 40 links from a sample basket.

RL Randomly generated company links from the N companies.CP40(RL) Number of competitor pairs that present in 40 randomly generated links.

CD40(CL) CP40(CL)/40, competitor density for a small citation-based network that consists of the 40 links from a sample basket.

CD40(RL) CP40(RL)/40, competitor density for a random network that consists of 40 random links.

CD(EL) CP(EL)/(N*(N-1)), competitor density for an exhaustive network (clique) that consists of the exhaustive links.

9.3.1 Examining the Competitor Coverage

126

From 40 company pairs in each sample basket in dataset I, we identify N and EL.

With the whole intercompany network we further find CL. CP(CL) and CP(EL) are

determined by the union of Hoover’s and Mergent. Figure 27 shows the competitor

coverage ratio for the intercompany network across the 21 sample baskets. We find that

the competitor coverage ratio is always greater than 66% and typically in the range of 85-

100% across the sample baskets. We also note that CL is a fraction of EL, ranging from

15% to 84%, across the sample baskets. Hence, our classification models will explore a

small subspace of all possible relationships by starting with the intercompany network

and the subspace covers most of the competitor pairs.

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 101112131415161718192021Basket

Rat

io

Figure 27 Competitor Coverage Ratio

9.3.2 Examining the Competitor Density

127

Using the union of data from Hoover’s and Mergent we label 40 company pairs in

each sample basket to find CP40(CL). Given N we randomly generate 40 links from the N

unique companies, and find CP40(RL). We repeat the procedures of random link

generation and link labeling 4 times and obtain an average number of CP40(RL). Then we

compute the competitor density CD40(CL)) and average CD40(RL) for all sample baskets.

Moreover, since we know CP(EL) from the subsection 4.1, we also calculate CD(EL).

Figure 28 shows the competitor density for the citation-based intercompany network,

random network, and exhaustive network across the 21 sample baskets. The curve for

average CD40(RL) is very close to that of CD(EL), indicating that the probability of

being a competitor pair in the randomly generated 40 pairs is consistent with that in the

exhaustive links. Moreover, CD40(CL) is much higher than average CD40(RL) and

CD(EL) in 20 of the 21 sample baskets. The difference in those probabilities tell us that

pairs in the intercompany network, for most of the baskets, are much more likely to be

competitor pairs than those in random links. The high competitor density in the

intercompany network for most sample baskets would be beneficial for classifiers in

competitor classification.

128

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket

Pro

babi

lity

CD40(CL) CD40(RL) CD(EL)

Figure 28. Probability of being a Competitor Pair

The results in the subsections 4.1 and 4.2 show that the citation-based

intercompany network has high competitor coverage and density and hence can be

expected to alleviate the problems associated with searching for competitors in an

exhaustive or random space of potential relationships. The results also confirm our

intuition that links in the citation-based intercompany network contain signals about

competitor relationship instead of being random. We are now ready to explore the

learning power of models based on the topological attributes of the intercompany

network described in Section 3 in discovering competitor relationships.

9.4 Competitor Discovery

Our competitor classification models use four types of attributes to classify a

company pair into competitors or non-competitors. The class label (dependent variable)

129

in the models is binary in nature while all of the classification attributes/variables are

continuous. This setup allows for applying a variety of standard binary classification

models. As is common in machine learning/data mining we use part of the data set of

training and leave a disjoint testing set to evaluate the discriminating power of the

models. In fact this training-testing process is repeated several times with different splits

of the data (cross-validation) to assure the robustness of observed results. We evaluate the

discriminating power of the models based on several standard metrics that are described

next.

9.4.1 Evaluation Metrics

Table 28 is the confusion matrix containing the actual and classified classes for a

classification problem with two class labels. TP is the number of true positives, TN is the

number of true negatives, FP is the number of false positives, and FN is the number of

false negatives.

Table 28.

Confusion Matrix

Classified class label

Positive NegativeActual

class labelPositive TP FNNegative FP TN

130

Using the confusion matrix we introduce the common metrics for evaluating and

comparing classification performance as follows:

(11)

(12)

(13)

(14)

(15)

In most classification problems precision and recall present a trade-off. As a

model tries to be conservative while classifying competitors in order to be boost

precision, it is expected to miss out on some of the competitors and hence achieve

reduced recall. The F measure is based on both precision and recall and the parameter α

denotes the relative importance of recall vs. precision. The parameter α is set to 1 to

produce F1, harmonic mean of precision and recall. Note that throughout the paper the

precision, recall and F measures are based only on the positive class (competitors) which

is the more important class here.

131

One of the most common metric to evaluate classifiers for imbalanced dataset is

Receiver Operating Characteristics (ROC) curve [Kotsiantis, et al. 2006]. It is a two

dimensional curve where TP rate (recall) is plotted on the y-axis and FP rate is on x-axis

(for specific examples see Figure 31). So a ROC curve can address an important tradeoff

– the number of correctly identified positives increases at the expense of introducing

additional false positives. The area under ROC, which is called AUC, is also an

evaluation metric.

9.4.2 Competitor Classification with Dataset I

Using the publicly available Weka API [Witten and Frank 2005], we employ four

classification methods – Artificial Neural Network (ANN), Bayes Net (BN), C4.5

decision tree (DT), and logistic regression (LR) – to classify whether a pair of companies

are competitors or not. Models based on ANN, BN, and DT are commonly used as

classifiers in data mining. LR is frequently used in business research for problems with a

binary class label (as for our competitor classification problem). For each sample basket,

except for sample basket # 21 which does not contain any competitor pair (we will handle

this basket together with three other baskets as the imbalanced dataset II in the next

subsection), we report the average precision and recall generated by 10-fold cross-

validation for each classification method. We use different classification methods so that

we may compare their performances for our application.

132

9.4.3 Competitor Classification with Dataset II

9.4.3.1 Background on Handling Imbalanced Dataset

Solutions to handling imbalanced dataset for classification problems exist at both

data and algorithmic levels. Several data level solutions use different re-sampling

approaches, such as undersampling majority, oversampling minority, or oversampling

minority by creating synthetic minority [Chawla et al. 2002], in order to change the prior

distribution of the original dataset [Kotsiantis, et al. 2006] before learning from the

dataset. Another approach in the data level is to segment the whole data into disjoint

regions such that the data in certain region(s) is not imbalanced [Weiss 2004].

Some of the popular solutions at the algorithmic level include the following:

Decision threshold adjustment (DTA)

Given a (normalized) probability of an example of being positive (or negative),

DTA changes the threshold that is used to decide which class label the instance is

assigned to [Kotsiantis, et al. 2006].

Cost-sensitive learning (CSL)

This method assigns fixed and unequal costs to different misclassifications, for

instance cost(false negative) > cost(false positive), such that the goal of CSL is to

minimize the cost of misclassification [Pazzani et al. 1994].

Recognition-based learning (RBL)

Different from a two-class classification method which learns rules for both

positive and negative classes, RBL is a one-class learning method in that it learns

only rules that classify the minority [Weiss 2004, Kotsiantis, et al. 2006].

133

In this paper we employ several techniques discussed above to handle our

imbalanced dataset. We use DWIOD to divide the whole dataset into 21 baskets, many of

which turn out to be more “balanced” than the entire data set. Hence the basketing

approach matches the “segment data” approach [Weiss 2004] for handling imbalanced

data sets. For the few imbalanced baskets, we sample more examples to form our

imbalanced dataset II. Then we employ two different approaches to attack the imbalanced

dataset problem. The first is the simple DTA approach and the second is an

undersampling-ensemble (UE) method (explained in Section 5.3.3). We do not choose

CSL approach mostly because we do not know what would be the right ratio for the cost

of FN vs. the cost of FP in the context of competitor classification problem. However we

think that essentially DTA and CSL are very similar in that they both create a bias

towards positive classifications. For either DTA or EU approach we still employ the same

four classification methods (ANN, BN, C4.5 and LR) from Weka. With the dataset II we

report various performance metrics suited for imbalanced dataset, including F1, precision,

TP rate, FP rate, ROC, AUC, and accuracy. Next we introduce the two approaches in

detail.

9.4.3.2 DTA Approach

By this approach we simply adjust the decision threshold which is used by a

classifier to decide whether an instance is classified as positive or negative given its

(normalized) probability of being positive. For example, given that Pr(x is positive) = 0.3,

134

the instance x is labeled as negative when the decision threshold is at 0.5. However, when

the threshold is adjusted to 0.2, x is classified as a positive.

For training and testing, we follow strict tuning procedures recommended in

[Salzberg 1997]. In particular, we randomly select 1500 instances as training set from the

imbalanced data set and the remaining 500 as testing set. Next, for each of the

classification methods we use 10-fold cross validation and tune input parameters to

observe the best performance on F1 measure using just the training set. Finally, we apply

each trained classifier with its respective “best” parameter setting to the testing set for

evaluation purposes. Moreover for robustness concern, we randomly divide the 2000

pairs into four disjoint sets of equal size which form four different pairs of training and

testing sets. Then we apply the above training-tuning-testing procedures to the four pairs

of training and testing sets, and report the average results (see formula in subsection 5.5).

We note that, in each case, training and parameter tuning is based only on the training

data set and evaluation of the trained and tuned classifier is based only on the testing data

set. For ANN, we tune learning rate from 0.1 to 1.0 and momentum from 0.1 to 0.3; for

BN we choose K2 [Cooper and Herskovitz 1992] and TAN[Friedman et al. 1997] as an

algorithm for search network structure; for DT, we change the minimum leaf size from 2

to 10; no parameter turning needed for LR. Besides we accept all other parameters as

default from Weka. We apply the same tuning procedures throughout study whenever

parameter tuning is used.

9.4.3.3 UE Approach

135

From the original imbalanced dataset II we generate multiple smaller “more-

balanced” sub datasets by duplicating all minority (positive) instances in each of the sub

sets and then evenly splitting the majority into those sub sets as depicted in Figure 29. A

classifier can be built from each sub dataset and an ensemble approach [Estabrooks and

Japkowicz 2001] can be used to generate the final classification result. Chan and Stolfo

[1998] adopt the similar undersampling method. We choose the majority vote as the

ensemble approach, and for the majority vote we use the binary output (0 or 1) of each

classifier and the probability output (between 0 and 1) of each classifier respectively and

denote them as majority vote by count (MVC) and majority vote by probability (MVP).

Figure 29. Generating More Balanced Sub Datasets

During the training phrase, with an initial ratio of positives in the subsets we tune

the parameters for each classifier (no parameter tuning needed for LR) and record the

performance of each classifier in an output file. Then we repeat the above procedure each

136

time with a different ratio of positives, which changes from 0.05 to 0.60 with a step size

of 0.05. From all output files, on the basis of the best performance on F1 measure we

determine a set of “best” parameters for a classifier and a best ratio of positives. Finally,

we apply the trained classifiers with their best parameter sets and the best ratios of

positives to the testing set for evaluation. Similarly, we divide the 2000 pairs into four

disjoint sets of equal size and generate results separately for the four pairs of training and

testing sets, and report the average results.

9.4.4 Classification Performance for Dataset I

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket

Precision Recall Prior

Figure 30. Precision and Recall of Dataset I by ANN and Prior Distribution

137

Figure 30 shows precision and recall achieved by ANN for individual sample

baskets. For comparison purpose, we also include the prior distribution of positives in

each sample baskets. The precision curve is almost always above the prior probalibity

except for the last two sample baskets which have the lowest prior distributions (5% and

2.5% respectively.) Hence, Figure 30 shows that while for most baskets the classification

performance of ANN is reasonably good, the performance is rather strong when the

DWIOD values are large (initial baskets) and rather weak when DWIOD values are very

small (last few baskets). The result highlights the inherent challenge of accurately

classifying the minority class for imbalanced data sets (the last few baskets). The other

three classification methods (BN, DT, and LR) show similar performance patterns

although lower performance overall. We show the results of applying special techniques

to imbalanced parts of the dataset in the next subsection.

9.4.5 Classification Performance for Dataset II

Table 29.

Classification Performance of Dataset II by DTA Approach

Without sector information With sector information**

Data SetOverall

performance ANN BN DT LR ANN BN DT LR

Training*

Precision 0.280 0.142 0.119 0.353 0.361 0.277 0.318 0.398Recall 0.227 0.277 0.467 0.220 0.443 0.520 0.403 0.410

False positive rate 0.031 0.088 0.182 0.021 0.041 0.071 0.045 0.033F1 0.250 0.188 0.190 0.271 0.398 0.362 0.356 0.404

Accuracy 0.932 0.880 0.801 0.941 0.933 0.908 0.927 0.940AUC 0.753 0.703 0.656 0.756 0.870 0.863 0.740 0.865

138

Test

Precision 0.268 0.125 0.090 0.322 0.372 0.262 0.283 0.380Recall 0.220 0.240 0.400 0.190 0.420 0.430 0.360 0.380

False positive rate 0.032 0.088 0.213 0.021 0.037 0.064 0.048 0.033F1 0.242 0.164 0.147 0.239 0.394 0.326 0.317 0.380

Accuracy 0.931 0.878 0.768 0.940 0.936 0.911 0.923 0.938AUC 0.736 0.672 0.610 0.723 0.858 0.853 0.741 0.834

* Results of training set are based on the best performance on F1 with parameter

tuning.

** Company’s sector used in Yahoo! finance is included as an attribute

Table 29 reports precision, TP rate (recall), FP rate, F1, accuracy, and AUC on

training and testing sets for each classification method using the DTA approach. Each

bold number in the table indicates the best performance for a measurement across the

four classification models for the testing set. Since we have four pairs of training (1500

instances) and testing (500 instances) sets, we generate and report overall performance

with the following equations which are based on definitions in equations 1 to 5.

(16)

(17)

139

(18)

(19)

(20)

TP, TN, FP, and FN are defined the same as those in subsection 5.1, and the

subscription i is a number between 1 and 4 which denote four disjoint testing sets from

dataset II.

This Table contains results for the same dataset with and without using sector

information (sector was encoded as a variable by nine categorical values). Using sector

information greatly improves the classification performance for dataset II cross the

four classifiers - for example, the maximum F1 measures (both produced by ANN)

increased by 63%. With sector information we do not observe a significant different on

F1 measure across the 20 baskets in dataset I (two tailed t-test p=0.827), indicating that

sector information is more helpful for imbalanced dataset than for more balanced

dataset. We find that for all 316 competitor pairs in dataset III (216 in the 17 sample

baskets of dataset I and 100 in dataset II), a total of 282 (89.2%) pairs are in the same

sector and 34 (10.8%) are not.

140

The EU approach with MVC and MVP produces similar results as those in

Table 29. For instance, for MVC the maximum values of F1 measures are 0.381 and

0.204 respectively with and without sector information. Although EU approach is more

complex than the simple DTA approach – undersampling majority to form multiple

smaller datasets and adjusting positive ratios in these small datasets, the latter produces

results as good as those by the former in our study. Thus in the following Section 6,

while estimating to what extent our approach extends a gold standard, we use the

results from the DTA approach.

Table 29 shows ANN has the largest AUC values. The following Figure 31

illustrates ROC curves for the four classifiers using sector information. The ROC

curves for ANN, BN and LR are close to each other and ANN and LR are all above

(always slightly outperform) the DT’s curve. The diagonal line is generated by

randomly labeling instances with different likelihoods. For example, when the

classifier randomly guesses the positive class 10% of the time, it is expected to find

10% of the positives correct, having a TP rate of 0.1. At the same time, it identifies

90% of the negatives correct, leading to a FP rate at 0.1 (1-0.9) as well. So the process

of guessing the positive class 10% of the time yields the point (0.1, 0.1) in the ROC

space. Following this example, random guess with all different likelihoods generates

the diagonal line.

141

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FP rate

TP r

ate

ANN BN DT LR

DTANN

BNLR

Figure 31. ROC Curves of Dataset II for Four Classification Methods

9.4.6 Estimated Overall Classification Performance on the Basis of Dataset III

All of our classification performance measurements until now have been

computed for each sample basket. Since sample baskets consist of random samples of the

original (larger) baskets, the performance results are representative of the performance on

those original baskets. However, we would now like to estimate the classification

performance for all of the baskets combined, the whole dataset with the 87,340 pairs. The

estimation would require us to extrapolate the performance observed on sample baskets

to the entire original basket. Hence, we adopt the following equations to estimate overall

precision, TP rate (recall), FP rate, F1, and accuracy using dataset III. For the 17 sample

baskets from dataset I, the classification results are produced by 10-fold cross validation.

For the 18th sample basket we use the results generated from four disjoint testing sets

142

(each with 500 instances). Since the 2000 pairs are proportionally sampled from the four

of the 21 baskets, we can combine the results of the four disjoint test set as one

“combined basket” (new sample basket #18) in the following equations.

(21)

(22)

(23)

(24)

(25)

Where Bi is the size of basket i, and Si is the size of sample basket i.

143

Hence, the above equations estimate the overall classification performance by

extending performance measurements for a sample basket to the corresponding full

basket and then combining the measurements across the 18 baskets in dataset III. For

example, if the sample basket Si, which represents the original basket Bi,, contains k

classified positives, the original basket Bi can be expected to contain classified

positives. We note that the above equations estimate the overall classification

performance for the whole dataset of 87,340 pairs. So the estimation indicates the

performance of an ensemble of 18 classifiers (one for each basket) based on a given

classification method. The estimated overall prior probability for positives is 11.8%

(about 1 in 9 pairs in the original data set is a competitor pair). We note that ANN has the

best performance on more metrics than other three methods. However, different from

ANN, DT, or BN, LR does not require any parameter turning and produces comparable

good results. We highlight the best performance value under each measurement.

Table 30.

Estimated Overall Performances

Without sector information With sector information Precision Recall FP rate F1 Accuracy Precision Recall FP rate F1 AccuracyANN 0.419 0.378 0.046 0.397 0.907 0.450 0.513 0.055 0.479 0.910BN 0.238 0.354 0.095 0.284 0.863 0.388 0.514 0.071 0.442 0.895DT 0.167 0.463 0.203 0.245 0.770 0.432 0.457 0.053 0.444 0.907LR 0.388 0.330 0.046 0.357 0.904 0.382 0.437 0.062 0.407 0.897

9.5 Competitor Extension

144

In the introduction, we noted with an anecdote that our gold standards are

expected to be incomplete. We now suggest metrics to estimate (11) the coverage of

competitors by a gold standard and (12) the extent to which our approach extends each

gold standard.

9.5.1 Estimating the Coverage of a Gold Standard

Figure 32. Competitors Covered by Two Gold Standards

We will need the following notation (from Figure 32) to describe the estimation

procedure:

C: the (unknown) complete set of competitor pairs

H: the set of competitor pairs covered by Hoover’s

M: the set of competitor pairs covered by Mergent

JHM = H M, the intersection of H and M

Following the idea proposed in the highly cited study [Lawrence and Giles 1998]

to estimate the coverage of search engines, assuming H and M to be independent subsets

145

of C, we can estimate to what extent H covers C based on how much of H covers M (i.e.,

JHM) and the size of M. We therefore define the coverage of the entire competitor set C by

Hoover’s ( ) and Mergent ( ) as follows:

Cov(H) = (26)

Cov(M) = (27)

If H and M are not completely independent, it is apparent that the value of JHM

(their intersection) would tend to be larger than when they are independent. Hence, we

may overestimate the coverage of a gold standard and the coverage estimation can be

considered an upper bound on true coverage of the gold standard.

We have previously labeled the positive instances using Hoover’s and Mergent

for each of the sample baskets. Hence, we can compute the number of competitor pairs

identified by Hoover’s ( ) and Mergent ( ) separately as well as the intersection of

Hoover’s and Mergent ( ) for the ith sample basket. In a manner similar to defining

the equation 11 in subsection 5.6, we estimate the number of positives (for Hoover’s,

Mergent, and their intersection) in each original basket by multiplying the number of

positives in the sample basket by the ratio of basket size over the sample basket size.

Then, based on equations (26) and (27), we calculate the coverage of Hoover’s and

Mergent as follows.

146

(28)

(29)

We find that the estimated coverage of Hoover’s and Mergent is 46.0% and

24.9%, respectively. Our estimation shows that while Hoover’s covers almost twice as

many competitor pairs as Mergent, both data sources individually cover less than 50% of

all competitor pairs.

9.5.2 Estimating the Extension of Our Approach to a Gold Standard

147

Figure 33 Competitors Covered by Two Gold Standards and Our Approach

We now present a procedure to estimate how much our automated approach can

extend a gold standard. Our estimation procedure uses the following notation:

O: the set of competitor pairs classified by our approach

O = C – O

H = C – H

M = C – M

JHMO = H M O

JHMO = H M O

JHMO = H M O

JHMO = H M O

JHMO is a subset of competitor pairs that are classified positive by our approach and

confirmed to be positive by Mergent. However, these pairs are not identified as

competitors by Hoover’s. Since Mergent is a sample of all competitor pairs, we estimate

the extent to which our approach extends Hoover’s (Ext(O, H)) as follows:

Ext(O, H) = (30)

Similarly, we estimate the extent to which our approach extends Mergent (Ext(O,

M)) as follows:

148

Ext(O, M) = (31)

Based on equations (30) and (31), we compute the expansion of our approach to

each gold standard using results from dataset III with the following equations.

Ext(O, H) = (32)

Ext(O, M) = (33)

Table 31.

Extensions to a Gold Standard

Without sector information With sector informationANN BN DT LR ANN BN DT LR

Ext(O, H) 5.9% 7.3% 15.3% 5.0% 12.1% 11.3% 10.1% 10.5%Ext(O, M) 28.7% 23.4% 37.2% 24.3% 33.8% 37.1% 35.8% 32.9%

Table 31 shows the estimation of how much our approach extends over the

knowledge available in each of the two gold standards. The results are shown for

149

different classification methods, and also with and without the use of sector information.

Using the sector information, our approach, with any classification method, extends

Hoover’s and Mergent by over 10% and 32% respectively. We note that the values of our

extension to a gold standard are based on classification results that have been generated

using a set of input parameters and classification methods. Just as the ROC curves in

Figure 31 illustrate, we could achieve higher TP rate (recall) by adjusting some of those

parameters, therefore obtaining higher values for our expansion to a gold standard, but at

the cost of higher FP rate, which leads to lower precision. The expansion results in Table

31 are associated with estimated overall precision, recall, and FP rate at 0.419, 0.378, and

0.046 (without sector information) and 0.450, 0.513, and 0.055 (with sector information)

respectively as shown in Table 30 by ANN classifier.

9.6 Explorations of Competitors vs. Non-competitor Pairs

In next two subsections, we report more exploration results on structural

equivalence similarity between competitor and non-competitor pairs, and on company

annual revenues between competitor pairs with high and low DWIOD values.

9.6.1 SE Similarity Comparison between Competitor and Non-competitor Pairs

For each sample basket of dataset III, we compute and compare the average SE

similarities for competitor and non-competitor pairs. Figure 34 compares DWID-based

SE similarities of the 18 sample baskets in dataset III. Except for the last basket with the

150

smallest DWIOD values, the average SE similarities for competitor pairs are greater than

those for non-competitor pairs (two-tailed t-test, p=0.003), which indicates that on

average competitor companies are more structurally equivalent than non-competitors.

Similar patterns are observed for DWOD- and DWIOD-based SE similarities (two-tailed

t-test, p=0.008 and p=0.001 respectively).

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Basket

Competitor Non-competitor

Figure 34. Average DWID-based SE Similarity Comparison

9.6.2 Comparing Annual Revenues between Competitor Pairs with High and Low

DWIODs

We observe that the average revenue of company pairs with low DWIOD values

(100 competitor pairs in dataset II) is significantly (two-tailed t-test, p<0.001) lower than

151

the average revenue of company pairs with high DWIOD values (92 competitor pairs in

the first five sample baskets, #1 - # 5, in dataset I.)

9.7 Discussions

Given that news portals organize news stories by company and a news story

pertaining to a company often cites several other companies, we consider such company

citations as outlinks from the source company to target companies and construct a

directed and weighted intercompany network. On the basis of 60,532 news stories from

an 8-month period collected from Yahoo! Finance, the network consists of 6,428

companies as nodes and 87,340 links (company pairs). Using SNA techniques we

identify four types of attributes from the network structure. One of the attributes, dyadic

weighted inoutdegree (DWIOD), which captures the notion of overall volume of citations

between two companies in news, is used to split our data set of 87,340 company pairs

into 21 baskets. We generate three datasets – Dataset I consists of 840 pairs randomly

sampled from the 21 baskets, dataset II has 2,000 pairs and represents an imbalanced

portion of the whole set where the number of competitor pairs is much fewer than that of

non-competitor pairs, and dataset III is generated from the first two. We use two

company profile Web sites, Hoover’s and Mergent, as gold standards to identify/label a

company pair as competitors or not.

Before conducting classification, with dataset I we first examine the competitor

coverage and competitor density of the intercompany network by comparing with those

in an exhaustive network (clique) and a random network. We find that the intercompany

152

network covers 66-100% known competitor pairs and at the same time its size (in number

of links) is 15-84% of the clique; the competitor density in the intercompany network is

several (2-52) times higher than that of the random network.

We employ four classification models (Artificial Neural Network, Bayes Net,

Decision Tree, and Logistic Regression) to classify competitor relationships and report

classification performance on the basis of 10-fold cross validation. The results on

individual sample baskets in dataset I reveals that typical classification methods fail to

detect the minorities in imbalanced datasets. Thus with dataset II we compare two

approaches that are capable of handing imbalanced dataset problem, and report their

performance. With dataset III we estimate the overall performance for the whole dataset

on the basis of the classification results from both datasets I and II. Besides, as another

aspect of this research we estimate to what extent a gold standard covers the whole

competitor space in competitor coverage. Finally we present metrics and estimate to what

extent our automatic approach can extend a gold standard.

In summary we present a data mining approach of using company citations in

news to discover competitor relationships. Many times the company citations seem to be

random, such as just co-occurrence, however, with large number of news articles, we can

discover meaningful business relationships with out approach.

153

REFERENCES

Adamic, L. A. 2002. Zipf, Power-laws, and Pareto - A Ranking Tutorial. http://ginger.hpl.hp. com/shl/papers/ranking/ranking.html.

Barábasi, A. L., R. Albert, H. Jeong. 2000. Scale-Free Characteristics of Random Networks The Topology of the World Wide Web. Physica A, 281 69-77.

Bernstein, A., S. Clearwater, S. Hill, F. Provost. 2002. Discovering Knowledge from Relational Data Extracted from Business News. In Proceedings of the KDD 2002 Workshop on Multi-Relational Data Mining, Edmonton, Alberta, Canada.

Brandes, U. 2001. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology, 25(2) 163-177.

Brin, S., L. Page. 1998. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1-7) 107-117.

Broder, A. 2002. A Taxonomy of Web Search. ACM SIGIR Forum, 36(2) 3-10.

Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. L. Wiener. 2000. Graph Structure in the Web. In Proceedings of the 9th World Wide Web Conference, 309-320.

Budzik, J., K. HAMMOND. 2000. User Interactions with Everyday Applications as Context for Just-in-time Information Access. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, 44-51.

Butler, D. 2000. Souped-up Search Engines. Nature, 405, 112-115.

Carroll, J., M. B. Rosson. 1987. The Paradox of the Active User. In Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, J.M. Carroll, Ed. MIT Press, Cambridge, MA.

Chan, P., S. Stolfo. 1998. Toward Scalable Learning with Non-uniform Class and Cost Distributions A Case Study in Credit Card Fraud Detection. Proceedings of the 4th

International Conference on Knowledge Discovery and Data Mining. New York City, NY, 164-168.

Chakrabarti, S., B. E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg. 1999. Mining the Web's link structure. Computer, 32(8), 60-67.

Chawla, N. V., K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 321-357.

154

Chirita, P.A., W. Nejdl, R. Paiu, C. Kohlschűtter. 2005. Using ODP Metadata to Personalize Search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 178-185.

Cooper, G., E. Herskovitz. 1992. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9(4) 309-347.

Craswell, N., D. Hawking, S. Robertson. 2001. Effective Site Finding Using Link Information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, LA, 250-257.

Cutting, D.R., D.R. Karger, J.O. Pedersen, J.W. Tukey. 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 318-329.

Deerwester, S., S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6) 391-407.

Dietterich, T.G. 1997. Machine Learning Research: Four Current Directions. AI Magazine, 18(4) 97-136.

Dreilinger, D., A. E. Howe. 1997. Experiences with Selecting Search Engines Using Metasearch. ACM Transactions on Information Systems, 15(3) 195-222.

Dumais S., H. Chen. 2001. Optimizing Search by Showing Results in Context. In Proceedings of Computer-Human Interaction, Seattle, WA, 277-284.

Eirinaki, R., M. Vazirgiannis. 2003. Web Mining for Web Personalization. ACM Transactions on Internet Technology, 3(1), 1-27.

Estabrooks, A., N. Japkowicz. 2001. A Mixture-of-experts framework for Learning from Unbalanced Data Sets. Proceedings of the 4th International Symposium on Intelligent Data Analysis. Lisbon, Portugal, 34-43.

Faloutsos, M., P. Faloutsos, C. Faloutsos. 1999. On power-law relationships of the Internet topology. In Proceedings ACM SIGCOMM, 251-262.

Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth. 1996. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press, Menlo Park, California, 1–30.

155

Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin. 2002. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1) 116-131.

Freeman, L. C. 1979. Centrality in Social Networks: Conceptual Clarification. Social Networks, 1, 215-239.

Friedman, N., D. Geiger, M. Goldszmidt. 1997. Bayesian Network Classifiers. Machine Learning 29(2-3) 131-163.

Garfield, E. 1979. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York.

Gauch, S., J. Chaffee, A. Pretschner. 2003. Ontology-based Personalized Search and Browsing. Web Intelligence & Agent Systems, 1(3/4) 219-234.

Giles, C. L., K. Bollacker, S. Lawrence. 1998. CiteSeer: An Automatic Citation Indexing System. In Proceedings of the 3rd ACM Conference on Digital Libraries, Pittsburgh, PA, USA, 89-98.

Glover, E., S. Lawrence, W. Brimingham, C. L. Giles. 1999. Architecture of a Metasearch Engine that Supports User Information Needs. In Proceedings of the 8th International Conference on Information Knowledge Management, Kansas City, MO, 210-216.

Gulati, R., M. Gargiulo. 1999. Where Do Interorganizational Networks Come From? American Journal of Sociology, 104(5), 1439-1493.

Hafri, Y., C. Djeraba. 2004. Dominos: A New Web Crawler’s Design. In Proceedings of the 4th International Web Archiving Workshop (IWAW), Beth, UK.

Hair, J. F., W. C. Black, B. J. Babin, R. E. Anderson, R. L. Tatham. 2006. Multivariate Data Analysis. 6th edition, Pretice Hall.

Hammer J., H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo. 1997. Extracting Semistructured Information from the Web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, AZ, 18-25.

Harris, Z. 1985. Distributional Structure. In The Philosophy of Linguistics. Katz, J.J., Ed. Oxford University Press, New York, 26-47.

Haveliwala, T.H. 2003. Topic-Sensitive PageRank. IEEE Transactions on Knowledge and Data Engineering, 15(4) 784-796.

Jansen, B.J., A. Spink, J. Bateman, T. Saracevic. 1998. Real Life Information Retrieval: A Study of User Queries on the Web. ACM SIGIR Forum. 32(1) 5-17.

156

Jansen, B. J., A. Spink, T. Saracevic. 2000. Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web. Information Processing and Management, 36(2) 207-227.

Jansen, B. J., A. Spink, J. Pederson. 2005. A Temporal Comparison of AltaVista Web Searching. Journal of the American Society for Information Science and Technology, 56( 6) 559-570.

Jansen, B. J., A. Spink. 2005. An Analysis of Web Searching by European AlltheWeb.com Users. Information Processing and Management, 41, 361-381.

Jeh, G., J. Widom. 2003. Scaling Personalized Web Search. In Proceedings of the 12th international conference on World Wide Web, Budapest, Hungary, 271 - 279.

JUNG. 2006. Java Universal Network/Graph Framework (ver. 1.7.4). http://jung.sourceforge.net

Käki, M. 2005. Findex: Search Result Categories Help Users when Document Ranking Fails. In Proceedings of the SIGCHI conference on Human factors in computing systems, Portland, OR, 131-140.

Kessler, M. M. 1963. Bibliographic Coupling between Scientific Papers. American Documentation 24 123-131.

Kleinberg, J. 1999. Authoritative Sources in a Hyperlinked Environment. Journal of ACM, 46(5) 604-632.

Knoblock, C. A. , S Minton, J. L. Ambite, N Ashish, N. Ashish, P. J. Modi, I. Muslea, A. G. Philpot, S. Tejada. 1998. Modeling Web Sources for Information Integration. In Proceedings of the 15th National Conference on Artificial Intelligence, Madison, WI, 211-218.

Kotsiantis, S., D. Kanellopoulos, P. Pintelas. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1)Lawrence, S., C. L. Giles. 1998. Searching the World Wide Web. Science 280 (3) 98-100.

Kraft, R., F. Maghoul., C. C. CHANG. 2005. Y!Q: Contextual Search at the Point of Inspiration. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, 816-823.

Kumar, R., P. Raghavan, S. Rajagopalan, A. Tomkins. 1999. Trawling the Web for Emerging Cyber-Communities. Computer Networks, 31(11-16) 1481-1493.

157

Lawrence, S. 2000. Context in Web Search. IEEE Data Engineering Bulletin, 23(3) 25-32.

Leory, G., A. M. Lally, H. Chen. 2003. The Use of Dynamic Contexts to Improve Casual Internet Searching. ACM Transactions on Information Systems, 21(3) 229-253.

Levine, J. H. 1972. The Sphere of Influence. American Sociological Review, 37(1) 14-27.

Liu, B. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 1st

edition, Springer.

Liu, F., C. Yu, W. Meng. 2004. Personalized Web Search for improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1) 28-40.

Lorrain, F., H. G. White. 1971. Structural Equivalence of Individuals in Social Networks. Journal of Mathematical Sociology 1 49-80.

Maltz, D., K. Ehrlich. 1995. Pointing the way: active collaborative filtering. In Proceedings of the Conference on Computer-Human Interaction, Denver, CO, 202-209.

Michael, T. 1997. Machine Learning. WCB/McGraw-Hill.

Menczer, F., G. Pant, P. Srinivasan. 2004. Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology, 4(4) 378-419.

Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, K. J. Miller. 1990. Introduction to WordNet: an On-line Lexical Database. International Journal of Lexicography, 3(4) 235-244.

Najork, M., A. Heydon. 2001. High-performance Web Crawling. In Handbook of Massive Data Sets, J. ABELLO, P. PARDALOS, AND M. RESENDE, Eds. Kluwer Academic Publishers, 25-45.

Oyama, S., T. Kokubo, T. Ishida. 2004. Domain-Specific Web Search with Keyword Spices. IEEE Transactions on Knowledge and Data Engineering, 16(1) 17-27.

Padmanabhan, B., Z. Zheng, S. Kimbrough. 2006. An Empirical Analysis of the Value of Complete Information for eCRM Models. MIS Quarterly, 30(2) 247-267.

Palmer J. W., J. P. Bailey, S. Faraj. 2000. The Role of Intermediaries in the Development of Trust on the WWW: The Use and Prominence of Trusted Third Parties and Privacy Statements. Journal of Computer-Mediated Communication, 5(3).

Park, H. W. 2003. Hyperlink Network Analysis: A New Method for the Study of Social Structure on the Web. Connections, 25(1) 49-61.

158

Pazzani, M., Merz, C., P. Murphy. 1994. Reducing Misclassification Costs. Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ, USA, 217-225

Pitkow, J., H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, T. Breuel. 2002. Personalized Search. Communication of the ACM, 45(9) 50-55.

Porter, M. 1980. An Algorithm for Suffix Stripping. Program, 14(3) 130-137.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA.

Richards, W. D., G. A. Barnett (Eds.) 1993. Progress in Communication Science, 12, Ablex Pub. Corp., Norwood, NJ.

Riloff, E., J. Shepherd. 1997. A Corpus-based Approach for Building Semantic Lexicons. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RI, 117-124.

Salton, G., M. J. McGill. 1986. Introduction to Modern Information Retrieval, McGraw-Hill, New York.

Salzberg, S. 1997. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Mining and Knowledge Discovery 1 317-327.

Schapire, R. E. 1999. A Brief Introduction to Boosting. Proceedings of the 16th

International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 1401-1406.

Scott, J. 2000. Social Network Analysis: A Handbook, 2nd ed., Sage Publications, London.

Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1) 1-47.

Sellen, A.J., R. Murphy, K. L. Shaw. 2002. How Knowledge Workers Use the Web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing our World, Changing Ourselves. Minneapolis, MN, 227-234.

Small, H. 1973. Co-citation in the Scientific Literature: A New Measurement of the Relationship between Two Documents. Journal of the American Society of Information Science 24(4) 265-269.

Shakes, J., M. Langheinrich, O. Etzioni. 1997. Dynamic Reference Sifting: A Case Study in the Homepage Domain. In Proceedings of the 6th International World Wide Web Conference, Santa Clara, CA, 189-200.

159

Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma. 2004. Web-page Classification through Summarization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, South Yorkshire, UK, 242- 249.

Shen, X., B. Tan, C. X. Zhai. 2005a. Context-Sensitive Information Retrieval Using Implicit Feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil, 43-50.

Shen, X., B. Tan, C. X. Zhai. 2005b. Implicit User Modeling for Personalized Search. In Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 824-831.

Speretta, M., S. Gauch. 2005. Personalizing Search Based on User Search Histories. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, Compiegne University of Technology, France, 622 - 628.

Srinivasan, P., F. Memczer, G. Pant. 2005. A General Evaluation Framework for Topical Crawlers. Information Retrieval, 8(3) 417-447.

Sugiyama, K., K. Hatano, M. Yoshikawa. 2004. Adaptive Web Search Based on User Profile Constructed without Any Effort from Users. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, 675-684.

Sullivan, D. 2000. NPD Search and Portal Site Study. Search Engine Watch: http://searchenginewatch .com/sereport/article.php/2162791.

Tan, A. H. 2002. Personalized Information Management for Web Intelligence. In Proceedings of World Congress on Computational Intelligence, Honolulu, HI, 1045-1050.

Tan, A. H., C. Teo. 1998. Learning User Profiles for Personalized Information Dissemination. In Proceedings of International Joint Conference on Neural Network, Anchorage, AK, 183-188.

Teevan, J., S. T. Dumais, E. Horvitz. 2005. Personalizing Search via Automated Analysis of Interests and Activities. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 449-456.

Uzzi, B. 1999. Embeddedness in the Making of Financial Capital: How Social Relations and Networks Benefit Firms Seeking Financing. American Sociological Review, 64, 481-505.

160

Walker, G., B. Kogut, W. Shan. 1997. Social Capital, Structural Holes and the Formation of an Industry Network. Organization Science, 8(2) 109-125.

Wasserman, S., K. Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK.

Weiss, G. M. 2004. Mining with Rarity: A Unifying Framework. Sigkdd Explorations 6(1) 7-19.

Weiss, G. M., F. Provost. 2003. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19 315-354.

Wen, J.R., J. Y. Nie, H. J. Zhang. 2002. Query Clustering Using User Logs. ACM Transactions on Information Systems, 20(1) 59-81.

Witten, I. H., E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., Morgan Kaufmann, San Francisco.

Xu, J., W.B. Croft. 1996. Query Expansion Using Local and Global Document Analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 4-11.

Yang, Y., X. Liu. 1999. A Re-Examination of Text Categorization Methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, 42-49.

Zaïane, O. R., M. Xin, J. Han. 1998. Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. In Proceedings of Advances in Digital Libraries, Santa Barbara, CA, 19-29.

Zamir, O., O. Etzioni. 1999. Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31(11-16) 1361-1374.

161