ieee transactions on knowledge and data...

14
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 1 Fine-Grained Knowledge Sharing in Collaborative Environments Ziyu Guan, Shengqi Yang, Huan Sun, Mudhakar Srivatsa, Xifeng Yan Abstract—In collaborative environments, members may try to acquire similar information on the Web in order to gain knowledge in one domain. For example, in a company several departments may successively need to buy business intelligence software and employees from these departments may have studied online about different business intelligence tools and their features independently. It will be productive to get them connected and share learned knowledge. We investigate fine-grained knowledge sharing in collaborative environments. We propose to analyze members’ Web surfing data to summarize the fine-grained knowledge acquired by them. A two-step framework is proposed for mining fine-grained knowledge: (1) Web surfing data is clustered into tasks by a nonparametric generative model; (2) a novel discriminative infinite Hidden Markov Model is developed to mine fine-grained aspects in each task. Finally, the classic expert search method is applied to the mined results to find proper members for knowledge sharing. Experiments on Web surfing data collected from our lab at UCSB and IBM show that the fine- grained aspect mining framework works as expected and outperforms baselines. When it is integrated with expert search, the search accuracy improves significantly, in comparison with applying the classic expert search method directly on Web surfing data. Index Terms—Advisor search, text mining, Dirichlet processes, graphical models 1 I NTRODUCTION I NTERACTING with the Web and with colleagues/friends to acquire information is a daily routine of many human beings. In a collaborative environment, it could be common that members try to acquire similar information on the Web in order to gain specific knowledge in one domain. For example, in a company several departments may successively need to buy business intelligence (BI) software, and employees from these departments may have studied online about different BI tools and their features independently. In a research lab, members are often focused on projects which require similar background knowledge. A researcher may want to solve a data mining problem using nonparametric graphical models which she is not familiar with but have been studied by another researcher before. In these cases, resorting to a right person could be far more efficient than studying by oneself, since people can provide digested information, insights and live interactions, compared to the Web. For the first scenario, it is more productive for an employee to get advices on the choices of BI tools and explanations of their features from experienced employees; for the second scenario, the first researcher could get Z. Guan is with the College of Information and Technology, North- west University of China, Xi’an, CN 710127. E-mail: welbyhe- [email protected] S. Yang, H. Sun and X. Yan are with the Department of Computer Science, University of California, Santa Barbara, CA 93106. E-mail: {sqyang, huansun, xyan}@cs.ucsb.edu M. Srivatsa is with IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. E-mail: [email protected] suggestions on model design and good learning materials from the second researcher. Most people in collaborative environments would be happy to share experiences with and give suggestions to others on specific problems. However, finding a right person is challenging due to the variety of information needs. In this paper, we investigate how to enable such knowledge sharing mechanism by analyzing user data. Fig. 1. An illustrative toy example for knowledge shar- ing in a collaborative environment. An illustrative toy example is given in Figure 1. One can use “tcpdump” to intercept a sequence of Web surfing activities (IP packets) for each mem- ber. The scene is, Alice starts to surf the Web and wants to learn how to develop a Java multithreading program, which has already been studied by Bob (red rectangle). In this case, it might be a good idea to consult Bob, rather than studying by herself. We aim to provide such recommendations by analyzing

Upload: others

Post on 22-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 1

Fine-Grained Knowledge Sharing inCollaborative Environments

Ziyu Guan, Shengqi Yang, Huan Sun, Mudhakar Srivatsa, Xifeng Yan

Abstract—In collaborative environments, members may try to acquire similar information on the Web in order to gain knowledge

in one domain. For example, in a company several departments may successively need to buy business intelligence software

and employees from these departments may have studied online about different business intelligence tools and their features

independently. It will be productive to get them connected and share learned knowledge. We investigate fine-grained knowledge

sharing in collaborative environments. We propose to analyze members’ Web surfing data to summarize the fine-grained

knowledge acquired by them. A two-step framework is proposed for mining fine-grained knowledge: (1) Web surfing data is

clustered into tasks by a nonparametric generative model; (2) a novel discriminative infinite Hidden Markov Model is developed

to mine fine-grained aspects in each task. Finally, the classic expert search method is applied to the mined results to find proper

members for knowledge sharing. Experiments on Web surfing data collected from our lab at UCSB and IBM show that the fine-

grained aspect mining framework works as expected and outperforms baselines. When it is integrated with expert search, the

search accuracy improves significantly, in comparison with applying the classic expert search method directly on Web surfing

data.

Index Terms—Advisor search, text mining, Dirichlet processes, graphical models

1 INTRODUCTION

INTERACTING with the Web and withcolleagues/friends to acquire information is a daily

routine of many human beings. In a collaborativeenvironment, it could be common that members tryto acquire similar information on the Web in order togain specific knowledge in one domain. For example,in a company several departments may successivelyneed to buy business intelligence (BI) software,and employees from these departments may havestudied online about different BI tools and theirfeatures independently. In a research lab, membersare often focused on projects which require similarbackground knowledge. A researcher may want tosolve a data mining problem using nonparametricgraphical models which she is not familiar with buthave been studied by another researcher before. Inthese cases, resorting to a right person could be farmore efficient than studying by oneself, since peoplecan provide digested information, insights and liveinteractions, compared to the Web. For the firstscenario, it is more productive for an employee to getadvices on the choices of BI tools and explanationsof their features from experienced employees; forthe second scenario, the first researcher could get

• Z. Guan is with the College of Information and Technology, North-west University of China, Xi’an, CN 710127. E-mail: [email protected]

• S. Yang, H. Sun and X. Yan are with the Department of ComputerScience, University of California, Santa Barbara, CA 93106. E-mail:sqyang, huansun, [email protected]

• M. Srivatsa is with IBM T. J. Watson Research Center, YorktownHeights, NY 10598. E-mail: [email protected]

suggestions on model design and good learningmaterials from the second researcher. Most people incollaborative environments would be happy to shareexperiences with and give suggestions to others onspecific problems. However, finding a right person ischallenging due to the variety of information needs.In this paper, we investigate how to enable suchknowledge sharing mechanism by analyzing userdata.

Fig. 1. An illustrative toy example for knowledge shar-ing in a collaborative environment.

An illustrative toy example is given in Figure 1.One can use “tcpdump” to intercept a sequence ofWeb surfing activities (IP packets) for each mem-ber. The scene is, Alice starts to surf the Web andwants to learn how to develop a Java multithreadingprogram, which has already been studied by Bob(red rectangle). In this case, it might be a good ideato consult Bob, rather than studying by herself. Weaim to provide such recommendations by analyzing

Page 2: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 2

surfing activities automatically. In this example, notnecessarily Bob is an expert in every aspect of Javaprogramming; however, due to his significant surfingactivities in Java multithreading, it is reasonable toassume that he has gained enough knowledge in thisarea so that he can help Alice (in practice we couldset a threshold on the amount of related surfing datato test significance). Even if Bob is still learning, hecould share his experiences in learning and possiblysuggest good learning materials to Alice, thus savingAlice’s effort and time.

This scenario departs from the traditional expertsearch problem in that expert search aims to finddomain experts based on their associated documentsin an enterprise repository, while our goal is to findproper “advisors” who are most likely possessingthe desired piece of fine-grained knowledge basedon their Web surfing activities. The semantic struc-tures hidden in Web surfing (as illustrated by Fig-ure 1) reflect people’s knowledge acquisition processand make Web surfing data significantly differentfrom enterprise repositories. Traditional expert searchmethods may not be able to well handle the Websurfing data. For example, in the above example Johnspent a lot of effort on “Java IO” which is onlypartially relevant to Alice’s need. If we simply treatWeb surfing data as a collection of documents andapply traditional expert search methods, John wouldbe ranked higher than Bob since he viewed morecontents about “Java”, though not quite relevant. Wewill analyze this issue in detail in Section 6 anddemonstrate it empirically in Section 7.2. Therefore, itis necessary to first summarize people’s fine-grainedknowledge reflected in their Web surfing activities byrecognizing the semantic structures, and then searchover the mined pieces of fine-grained knowledge (e.g.“Java IO”). We call this search scenario, “advisorsearch”, to differentiate from traditional expert search.We use the term advisor search to emphasize that theknowledge of the retrieved people might not be verydeep, but good enough to help others if they have notsolidly gained the related knowledge yet.

In order to analyze the knowledge acquired by Webusers, we propose to log and analyze users’ Web surf-ing data (not only search, but also browsing activities,which reveal a user’s knowledge gaining process).Users’ interactions with the Web can be segmentedinto different “tasks”, e.g., “learning Java” and “shop-ping”. Textual contents of a task are usually cohesive.We define a session (a document in Figure 1) as anaggregation of consecutively browsed Web contentsof a user that belong to the same task. Sessions areatomic units in our analysis. The content of sessionsin a task could evolve gradually: people usually learnbasic concepts first and then move towards advancedtopics. A task can be further decomposed into fine-grained aspects (called micro-aspects). A micro-aspectcould be roughly defined as a significantly more

cohesive subset of sessions in a task. For example,the task “learning Java” might contain “Java IO” and“Java multithreading” as two micro-aspects. Whenpursuing a task, a user could spend many sessionson a micro-aspect. Mining these micro-aspects (micro-knowledge) is critical: it can provide a detailed de-scription of the knowledge gained by a person, whichis the basis for advisor search.

We propose a two-step framework for mining fine-grained knowledge (micro-aspects): (1) In the firststep, we formulate tasks from sessions. We design aninfinite Gaussian mixture model based on DirichletProcess (DP) [12] to cluster sessions. We adopt anonparametric scheme since the number of tasks isdifficult to predict. (2) We then extract micro-aspectsfrom sessions in each task. The challenges are: thenumber of micro-aspects in a task is unknown; ses-sions for different micro-aspects of a task are tex-tually similar; sessions for a micro-aspect might notbe consecutive. To this end, a novel discriminativeinfinite Hidden Markov Model (d-iHMM) is proposedto mine micro-aspects and evolution patterns (if any)in each task. A background model is introduced inorder to enhance the discriminative power. Finally, weapply a language model based expert search method[1] over the mined micro-aspects for advisor search.

To our knowledge, there is no existing techniquesfor micro-aspect mining. Although the hierarchicaltopic modeling algorithm [5] can discover general-to-specific topic hierarchies, it decomposes sessionsinto topics but not groups them. A person with manysessions containing partially relevant topics wouldstill be ranked unexpectedly high (like the “John vs.Bob” problem aforementioned). Our goal is to detectpeople’s online learning activities (e.g. learning “JavaIO”) in session data reflected by subsets of sessions,rather than discerning topics hierarchically in sessions(e.g. “Java” with “IO” as its subtopic). Mining thesemantic structures in sessions (Figure 1) is importantfor advisor search, as will be shown in Section 6.

The proposed two-step framework intrinsicallygroups sessions into micro-aspects in a coarse-to-finefashion, which bears some similarity to hierarchicalclustering [30]. However, traditional hierarchical clus-tering methods could not handle session data well.This is because people usually go to the same Websitefor different micro-aspects of a task. For example,researchers could commonly use Google Scholar1 tolook for papers related to Clustering. Two micro-aspects, “spectral clustering” and “density-based clus-tering”, can be difficult to separate since there are alot of background contents in their sessions, e.g. naviga-tional texts, template texts, etc. These background con-tents can drastically blur the boundary between thetwo micro-aspects, considering they are already simi-lar. Hence, traditional hierarchical clustering methods

1. http://scholar.google.com

Page 3: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 3

could easily mess up micro-aspects of a task, whilethe proposed d-iHMM model can better separate dif-ferent micro-aspects since it models the backgroundcontents explicitly. The experimental results verify ouranalysis: d-iHMM does better than iHMM [2] (whichcan be approximately regarded as clustering sessionsinto micro-aspects without background modeling) inmicro-aspect mining.

The contributions of this work are summarized asfollows. (1) We propose the fine-grained knowledgesharing problem in collaborative environments. Thegoal is not finding domain experts but a person whohas the desired specific knowledge. This problemis significant in practice in that learning from anadvisor (if she/he is easy to find) might be moreefficient than studying on the Web (though not al-ways). A lot of repeating efforts could be saved. (2) Wepropose to solve this problem by first summarizingWeb surfing data into fine-grained aspects, and thensearch over these aspects. We compare this strategywith searching advisors directly over sessions (i.e.applying traditional expert search methods on Websurfing data directly) both analytically (Section 6)and empirically (Section 7.2). (3) We propose a noveltwo-step micro-aspect mining framework consistingof two nonparametric DP models. This frameworkdoes not require pre-specified number of tasks andnumber of micro-aspects of a task, and can correctlysummarize sessions into micro-aspects by explicitlymodeling background contents in sessions of a task.(4) We collect real user-generated Web surfing dataat our lab and at IBM to test the feasibility of ouridea. Experiments on both datasets show that ourscheme is effective and outperforms the one using rawsession data. We also show the proposed DP modelswork as expected and achieve better performancethan baseline methods. Certainly, the selection of thesealgorithms is not unique. Through this study, wedemonstrate the possibility of finding right personsautomatically by analyzing their Web surfing data.

The rest of the paper is organized as follows: thenext section outlines related work. Section 3 givesthe problem formulation. In Section 4 we present theGaussian DP model for clustering sessions into tasks.Section 5 describes the proposed d-iHMM model formining fine-grained aspects in each task (i.e. sessioncluster), followed by a discussion of the advisor searchmethod based on the mined micro-aspects and adetailed analytical comparison to searching directlyover sessions in Section 6. Experiments are describedin Section 7, and finally, Section 8 concludes our work.

2 RELATED WORK

In this section we review research fields that arerelated to our work: expert search, analysis of usersearch tasks and topic modeling.

2.1 Expert Search

Expert search aims at retrieving people who haveexpertise on the given query topic. Early approachesinvolve building a knowledge base which contains thedescriptions of people’s skills within an organization[8]. Expert search became a hot research area sincethe start of the TREC enterprise track [9] in 2005.Balog et al. proposed a language model framework[1] for expert search. Their Model 2 is a document-centric approach which first computes the relevance ofdocuments to a query and then accumulates for eachcandidate the relevance scores of the documents thatare associated with the candidate. This process wasformulated in a generative probabilistic model. Baloget al. showed that Model 2 performed better [1] and itbecame one of the most prominent methods for expertsearch. Other methods have been proposed for enter-prise expert search (e.g. [11], [24]), but the nature ofthese methods is still accumulating relevance scores ofassociated documents to candidates. Expert retrievalin other scenarios has also been studied, e.g. onlinequestion answering communities [19], academic soci-ety [10].

The proposed advisor search problem is differentfrom traditional expert search. (1) Advisor search isdedicated to retrieving people who are most likelypossessing the desired piece of fine-grained knowl-edge, while traditional expert search does not explic-itly take this goal. (2) The critical difference lies in thedata, i.e. sessions are significantly different from doc-uments in enterprise repositories. A person typicallygenerates multiple sessions for a micro-aspect of atask, e.g. a person could spend many sessions learningabout Java multithreading skills. In other words, theuniqueness of sessions is that they contain semantic struc-tures which reflect people’s knowledge acquisition process.If we treat sessions as documents in an enterpriserepository and apply the traditional expert searchmethods (e.g. [1]), we could get incorrect ranking: dueto the accumulation nature of traditional methods, acandidate who generated a lot of marginally relevantsessions (same task but other micro-aspects) will beranked higher than the one who generated less buthighly relevant sessions, e.g. John vs. Bob in Figure 1for the query “Java multi-thread programming” (Sec-tion 6 will provide a detailed analysis of this issue).Therefore, it is important to recognize the seman-tic structures and summarize the session data intomicro-aspects so that we can find the desired advisoraccurately. In this paper we develop nonparametricgenerative models to mine micro-aspects and showthe superiority of our search scheme over the simpleidea of applying traditional expert search methods onsession data directly.

Page 4: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 4

2.2 Analysis of Search Tasks

Recently, researchers have focused on detecting, mod-eling and analyzing user search tasks from querylogs. Here we name some representative works. Jonesand Klinkner found that search tasks are interleavedand used classifiers to segment the sequence of userqueries into tasks [15]. Liu and Belkin combined taskstage and task type with dwell time to predict theusefulness of a result document, using a 3-stage and2-type controlled experiment [18]. Ji et al. used graphregularization to identify search tasks in query logs[14]. Kotov et al. designed classifiers to identify same-task queries for a given query and to predict whethera user will resume a task [16]. Wang et al. formulatedthe cross-session search task mining problem as asemi-supervised clustering problem where the depen-dency structure among queries in a search task wasexplicitly modeled and a set of automatic annotationrules were proposed as weak supervision [28].

This line of research tries to recover tasks frompeople’s search behaviors and bears some similarity toour work. Nevertheless, our work differs from theirsfrom the following aspects. First, we consider generalWeb surfing contents (including search), rather thansearch engine query logs. Query logs do not recordthe subsequent surfing activity after the user clickeda relevant search result. Moreover, it is found that 50%of a user’s online pageviews are content browsing[17]. Web surfing data provides more comprehensiveinformation about the knowledge gaining activitiesof users. Although various methods were proposedfor extracting search tasks in query logs, these meth-ods cannot be applied in our setting since they ex-ploit query log specific properties. Second, none ofthe above works tried to mine fine-grained aspectsfor each task. When studying, people could spendsome effort on one fine-grained aspect of a task andgenerate multiple contents. Summarizing fine-grainedaspects can provide a fine-grained description of theknowledge gained by a person. Finally, none of exist-ing works which analyze user online behaviors (notlimited to search behaviors, e.g. [29]) tried to addressadvisor search by exploiting the data generated fromusers’ past online behaviors.

2.3 Topic Modeling

Topic modeling is a popular tool for analyzing topicsin a document collection. The most prevalent topicmodeling method is Latent Dirichlet Allocation (LDA)[7]. Based on LDA, various topic modeling methodshave been proposed, e.g. the dynamic topic model forsequential data [6] and the hierarchical topic modelfor building topic hierarchies [5]. The Hierarchical DP(HDP) model can also be instantiated as a nonpara-metric version of LDA [25]. However, our problemis not a topic modeling problem. Our goal is torecover the semantic structures of people’s online

learning activities from their Web surfing data, i.e.identifying groups of sessions representing tasks (e.g.learning “Java”) and micro-aspects (e.g. learning “Javamultithreading”). While topic modeling decomposes adocument into topics. After applying topic modelingmethods on session data, it is still difficult to findthe right advisor by using the mined topics. Thisis because a person with many sessions containingpartially relevant topics would still be ranked unex-pectedly high, due to the accumulation of relevanceamong sessions. Grouping sessions into micro-aspectsis important for advisor search.

3 PROBLEM FORMULATION

We refer to a member in a collaborative environmentas a “candidate”. In our problem, we have a groupof h candidates e1, . . . , eh where each candidate eigenerates a sequence of sessions W(i) = w

(i)1 , . . . ,

w(i)Ni

. For clarity, we group sessions of different can-didates together using a uniform subscript index:W = w1, . . . , wN, with totally N sessions. The fine-grained knowledge sharing problem consists of threesubproblems:

1) Partition W into a set of clusters C =C1, . . . , Ct where each cluster represents atask;

2) Partition sessions in each Ci into a set of micro-aspects Si = si1, . . . , siti, where each micro-aspect sij is a significantly more cohesive subsetof sessions from Ci.

3) Compute the association weight between ei andsjk as |W(i)

⋂sjk|. Given a query q, produce a

ranking of ei according to their relevance toq assessed by the relevance of the associatedmicro-asepcts and the corresponding associationweights.

4 SESSION CLUSTERING

The input of this step is W , where each wi is aD0 × 1 word frequency vector with D0 as the vo-cabulary size. The intuition is that contents generatedfor the same task are textually similar while those fordifferent tasks are dissimilar. Hence, clustering is anatural choice for recovering tasks from sessions. Inour case, it is difficult to preset the number of tasksgiven a collection of sessions. Therefore, we need toautomatically determine the number of clusters (k),which is also one of the most difficult problems inclustering research. Most methods for automaticallydetermining k run the clustering algorithm with dif-ferent values of k and choose the best one accordingto a predefined criterion [13], which could be costly.In Spectral Clustering, a heuristic for determining kis to search for a significant raise in the magnitude ofthe eigenvalues [27]. However, this does not work inour context since the Web contents are so noisy that

Page 5: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 5

k

vk

zik

yi

i=1,...,N

m0

0

a0

b0

k

Fig. 2. The graphical representation of the GaussianDirichlet Process mixture model.

eigenvalues start to raise gradually from the secondsmallest eigenvalue. In this work, we advocate usinga generative model with a Dirichlet Process (DP) prior[12] for clustering. DPs provide nonparametric priorsfor k and the most likely k is learned automatically. ADP, written as G ∼ DP(α,G0), can be interpreted asdrawing components (clusters here) from an infinitecomponent pool, with α called the scaling parameterand G0 being the prior for a random component.An intuitive interpretation of DP is the stick-breakingconstruction:

πi(v) = vi

i−1∏

j=1

(1− vj), G =

∞∑

i=1

πiδψi,

where v = v1, v2, . . . with each vi drawn from theBeta distribution Beta(1, α), ψi is a component drawnfrom G0 and δψi

is an atom at ψi. πi is the mixtureweight of ψi given by breaking the current length ofthe “stick” (i.e.

∏i−1j=1(1 − vj)) by the fraction vi. The

generation of π is often written as π ∼ GEM(α). πdefines a prior mixing distribution among the infinitemany components. The posterior mixing distributionand the real number of components drawn from theDP is then learned from the data. Readers are referredto [25] for a detailed description of DPs.

4.1 Clustering by GDP Mixture Model

When using probabilistic models for clustering, theGaussian mixture model is a common choice andcan be viewed as a probabilistic version of k-means[13]. However, the data dimensionality D0 is too highto apply Gaussian distributions in our case (oftenabove 10K). Therefore, we first apply the well-knownLaplacian Eigenmap (LE) technique [3] to reduce thedimensionality from D0 to D where D0 ≫ D. Wechoose LE since it could also capture the nonlinearmanifold structure of a task, e.g. the topics of a taskcould evolve and drift which could be described bythe “half-moon” structure [3].

Let Y = y1, . . . , yN denote the session vectorsin the subspace learned by LE. The graphical rep-resentation of GDP is depicted in Figure 2. The DPprior is represented by the stick-breaking constructionprocess, with vk∞k=1 and α defined above. zi is an

assignment variable of the mixture component (i.e.cluster) with which yi is associated. µk and λk areD × 1 vectors denoting the mean vector and thediagonal precision matrix’s diagonal vector of the k-thGaussian component. This means each dimension ofthe data is independent. m0, β0, a0, b0 are the set ofhyperparameters for the Gaussian-Gamma conjugateprior of Gaussian components. The generative processof GDP is as follows

1) Draw vk|α ∼ Beta(1, α), k = 1, 2, . . .2) Draw λkj |a0, b0 ∼ Gamma(a0, b0), k =

1, 2, . . . ; j = 1, . . . , D3) Draw µk|m0, β0, λk ∼ N (m0, (β0D

λk)−1), k =1, 2, . . .

4) For the i-th session:

a) Draw zi|v ∼ Mult(π(v))b) Draw yi|zi, µ, λ ∼ N (µzi , (D

λzi )−1)

Here Mult(·) denotes the multinomial distribution.Dλk is a diagonal matrix with elements of λk on its

diagonal.The infinite Gaussian mixture model has already

been proposed and used for clustering. The over-all design of our model follows previous work, butwith a different design choice for the prior of λk.Previous models treat λk either as the full precisionmatrix with a Wishart prior [23] or as a scalar with aGamma prior [21]. The former design choice is ableto model complicated cluster structures but the timecost is high, while the modeling power of the latterone is very limited. In our model, λk is a vector,meaning that different dimensions are independentand have different precisions. The reasons are: (1) theoutput of LE is the orthogonal eigenvectors of a realsymmetric matrix, which means different dimensionsare independent; (2) different dimensions may havedifferent degrees of variation. In this way, the modelis relatively more efficient and, meanwhile, retainscertain expressive power.

Either Gibbs sampling or variational inference canbe used to solve the GDP model. Although Gibbs sam-pling can provide theoretical guarantees of accuracy,variational inference converges much faster and canalso provide a reasonable approximation to the realposteriors [4]. Hence, we choose variational inferencein this work. The derivation is simply an applicationof the general derivation in [4]. Readers can referto the appendix (as a supplemental material) for adetailed description of the variational inference of theGDP model.

5 MINING FINE-GRAINED KNOWLEDGE

The major challenge of mining micro-aspects is thatthe micro-aspects in a task are already similar withone another. If we model each component (i.e. micro-aspect) independently (as most traditional modelsdo), it is likely that we mess up sessions from dif-ferent micro-aspects, i.e. leading to bad discrimina-

Page 6: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 6

k

0

k

s1 s2 sT

w1ik

w2i wTih0

b

z1i z2i zTi

0 1 2 T

s0

(a)

k

0

k

s1 s2 sT

w1

k

w2 wT

h0

s0

(b)

Fig. 3. The graphical representations of (a) the d-iHMM model, and (b) the original iHMM model.

tion. Therefore, we should model different micro-aspects in a task jointly, separating the common con-tent characteristics of the task from the distinctivecharacteristics of each micro-aspect. To this end, weextends the infinite Hidden Markov Model (iHMM)[2] and propose a novel discriminative infinite HiddenMarkov Model (d-iHMM) to mine micro-aspects andpossible evolution patterns in a task. The graphicalrepresentation is shown in Figure 3(a). The observedvariables of the model is a sequence of T sessionswtTt=1 belonging to a cluster outputted by the GDP.In practice, a cluster can contain multiple sequencesfrom different people. However, we only discuss thesingle sequence case for clarity. Extension to the multi-sequence case is a trivial work since those sequencesdo not depend on each other. A background unigrammodel, θb, which is estimated by aggregating all thetask’s sessions, is shared among states (i.e. micro-aspects). st’s are state assignment variables. zti is anindicator variable, controlling whether word wti isgenerated by θst (zti = 1) or θb (zti = 0). ηt ∈ (0, 1)encodes the fraction of words in wt which are not gen-erated by θb, with δ0 as a Beta prior. In this way, thecommon background contents in the task’s sessionsare explained by θb, thus increasing the discriminativepower of the model. d-iHMM is fundamentally basedon the Hierarchical DP model [25] where the infinitecomponent pools (corresponding to πk) of all statesshare the same base infinite component pool (corre-sponding to β). The generative process is summarizedas follows

1) Draw β|γ0 ∼ GEM(γ0)2) For k = 1, 2, . . . , draw πk ∼ DP(α, β), θk ∼

Dir(h0, . . . , h0)3) For t = 1 to T :

a) Draw st|st−1 ∼ Mult(πst−1), ηt|δ0 ∼

Beta(1, δ0)b) For each word wti in the t-th session

i) Draw zti|ηt ∼ Bernoulli(ηt)ii) If zti = 1, draw wti ∼ Mult(θst), else

draw wti ∼ Mult(θb)

Here Dir(h0, . . . , h0) is a symmetric Dirichlet distri-bution where the dimension is the vocabulary size,D0. p(s1|s0) is given by a uniform multinomial dis-tribution. As in [25], we place Gamma priors onthe hyperparameters α and γ0: α ∼ Gamma(aα, bα)and γ0 ∼ Gamma(aγ0 , bγ0). The evolution patternslearned (if statistically strong) could be used for con-tent recommendation. Furthermore, d-iHMM couldbe iteratively applied on the learned micro-aspects tofind more fine-grained aspects.

The key difference between d-iHMM and iHMMis the background model part, i.e., θb, δ0, ηt’s andzti’s are all new nodes compared to the graphicalmodel of iHMM (Figure 3(b)). Modeling the commonbackground contents makes d-iHMM significantlydifferent from iHMM and enhances the discrimina-tive power so that we can identify relatively morecohesive subsets of documents from a set of cohesivedocuments with background contents.

5.1 Solving d-iHMM by Beam Sampling

The beam sampling method for iHMM is proposed in[26], which is shown to converge to the true posteriormuch faster than a classical Gibbs sampler. Therefore,we develop a beam sampler for our d-iHMM model.

Beam sampling adopts the slice sampling [22] ideato limit the number of states considered at each timestep to a finite number, so that dynamic programmingcan be used to sample whole state trajectories effi-ciently. Specifically, we first sample an auxiliary vari-able ut for each time step t in the sequence with con-ditional distribution ut ∼ Uniform(0, πst−1st). Givenut, the sequence st is re-sampled, consideringonly the values of st−1 that satisfy πst−1st > ut. Hence,ut act as a truncation of the infinite number of statesand make the number of trajectories with positiveprobabilities finite, so that the whole sequence can besampled holistically. st is sampled in one run by aforward filtering backward sampling algorithm.

Page 7: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 7

Each sampling iteration samples ut, st, zti,θk, πk and β in turn. We first sample ut as de-scribed above and create more states if the maximumunassigned probabilities in πk (the last element ofeach πk) is greater than the minimum of ut. Thenwe perform a forward sweep of st where for thet-th step we compute

p(st|w1:t, u1:t, zt) (1)

∝p(st, ut, wt|w1:t−1, u1:t−1, zt)

=p(wt|st, zt)

×∑

st−1

I(ut < πst−1st)p(st−1|w1:t−1, u1:t−1, zt−1),

where p(wt|st, zt) means we generate a word wtiby θst only if zti = 1, and I(C) = 1 if condi-tion C is true and 0 otherwise. sT is sampled fromp(sT |w1:T , u1:T , zT ). Then we perform a backwardsweep to sample each st given st+1 by

p(st|st+1, w1:T , u1:T , zt) = p(st|st+1, w1:t, u1:t+1, zt)

(2)

∝p(st|w1:t, u1:t+1, zt)p(st+1|st, ut+1)

∝p(ut+1|st)p(st|w1:t, u1:t, zt)p(ut+1|st, st+1)p(st+1|st)

p(ut+1|st)

=p(st|w1:t, u1:t, zt)I(ut+1 < πstst+1)

In order to efficiently sample zti, we integrate outηt. This makes all zti’s belonging to wt dependenton one another. Let z¬ti be the set of z variables forwt except zti. We have

p(zti|z¬ti, δ0, wt, θ, θb, st) (3)

∝p(zti, z¬ti, wt|δ0, θ, θb, st) = p(zt|δ0)p(wt|zt, st, θ, θb)

∝B(

∑j ztj + 1, |wt| −

∑j ztj + δ0)

B(1, δ0)p(wti|zti, st, θ, θb),

where |wt| is the number of words in wt. The finalsampling probability ratio is (omitting variables in thecondition for clarity):

p(zti = 1| · · · )

p(zti = 0| · · · )=

p(wti|θst)(∑

j 6=i ztj + 1)

p(wti|θb)(|wt| −∑j 6=i ztj + δ0 − 1)

,

(4)where p(w|θk) means the probability of generatingword w by θk.

For θk, since the prior Dirichlet distribution is con-jugate to the multinomial distribution, the posteriorsampling distribution of θk is

Dir(h0 + w(k, 1), h0 + w(k, 2), . . . , h0 + w(k, d)), (5)

where w(k, i) is the total number of word occurrencesof the i-th word in the vocabulary which are generatedby θk.

The sampling distributions of πk and β followdirectly from [25], but we briefly describe them forcompleteness. Let nij be the number of times wejump from state i to state j. Let M be the number of

Fig. 4. A toy example for comparing session basedand micro-aspect based advisor search.

distinct states in the sequence, relabeled from 1 to M .Merging the infinitely many states not represented inthe sequence into one state, the sampling distributionof (πk1, . . . , πkM ,

∑∞k′=M+1 πkk′ ) is

Dir(αβ1 + nk1, . . . , αβM + nkM , α

∞∑

i=M+1

βi). (6)

To sample β, a set of auxiliary variables mij is usedwith independent conditional distributions

p(mij = m|s, β, α) ∝ S(nij ,m)(αβj)m, (7)

where S(·, ·) denotes Stirling numbers of the firstkind. (β1, . . . , βM ,

∑∞k′=M+1 βk′ ) is then sampled by

Dir(m·1, . . . ,m·M , γ0) with m·k =∑Mk′=1mk′k.

6 ADVISOR SEARCH

After we obtain the mined micro-aspects of each task,advisor search can then be implemented on the col-lection of learned micro-aspects. We employ the tra-ditional language model based expert search method[1]. Let d be a document (i.e. micro-aspect). Givena query q, the method uses p(e|q) to rank advisorcandidates. By assuming uniform prior distributionsp(e) and p(d) and applying Bayes’ rule, it is equivalentto rank candidates by p(q|e) =

∑d p(q|d)p(d|e) or

p(q|e) ∝∑

d p(q|d)p(e|d) [1]. p(q|d) is the probabilityof generating q by d’s unigram model, with propersmoothing [1]. Intrinsically, the method can be viewedas a weighted accumulation of p(q|d)’s from the asso-ciated documents of e. Recall that the weight betweene and d is the number of sessions of e which fallin d. p(d|e) and p(e|d) encode the normalized asso-ciation weights between candidates and documentsfrom a candidate’s perspective (candidate scheme) anda document’s perspective (document scheme), respec-tively. The candidate scheme is not intuitive in ourcontext. Consider two candidates e1 and e2. e1 viewedtotally 100 sessions in which 10 sessions fall in d,while for e2 the two numbers are 10 and 2. Hence,p(d|e1) = 0.1 < p(d|e2) = 0.2. However, e1 viewedmore sessions in d than e2 and should have a strongerassociation. Therefore, the document scheme is usedfor ranking.

Compared to applying traditional expert searchmethods directly on session data, searching overmicro-aspects has the advantage that the associations

Page 8: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 8

between candidates and “documents” are correctlynormalized. A toy example is shown in Figure 4.Suppose a user wants to query about “Java multi-thread programming”. Suppose orange (with blacklines) sessions’ relevance scores are 0.9 and those ofgreen sessions are 0.1. Note the association weightsbetween candidates and sessions are all 1. If weapply the language model method on sessions directly(which means sessions are “documents”), the rankingscores of Bob and John are 1.8 and 2 respectively2.In contrast, by summarizing the two micro-aspectsin Figure 4 and employing micro-aspects as “docu-ments”, Bob’s score becomes 0.9 while that of Johnbecomes 0.1, which is our expectation.

Time Complexity The time cost of GDP can be re-garded roughly as O(NKDI) where I is the numberof iterations and K is the truncation level of the vari-ational distribution for the infinite Gaussian mixtures(see appendix). If other variables are fixed, the timecost is approximately linear in N . The time cost of d-iHMM is O((M2(T +1)+2(M+T )+Nw)I) where I isthe number of sampling iterations, M is the numberof states (could vary in different iterations) and Nwis the total number of words in the T sessions. d-iHMM could be more costly than GDP given T = N ,since Nw can be much larger than T . Fortunately, theruns of d-iHMM on different tasks are independentand consequently can be parallelized. It is possible tofurther speed up GDP and d-iHMM by incorporat-ing recent parallel computing techniques for machinelearning algorithms (e.g. [20]). Nevertheless, parallelcomputing is beyond the focus of this work, whichis to demonstrate the feasibility of summarizing fine-grained aspects from Web surfing data for advisorsearch. The advisor search step is efficient since themain cost is only one matrix vector multiplication(O(Sah), where Sa is the total number of micro-aspects and h is the number of candidates).

7 EXPERIMENTS

We validate our methods for fine-grained knowledgemining and advisor search on real Web surfing data.Firstly, we show the results of advisor search usingthe full pipeline. Then we evaluate the individualmodels, GDP for session clustering and d-iHMM forfine-grained knowledge mining.

7.1 Datasets

We collected Web surfing data from two real collabo-rative environments: (1) the data mining lab at UCSB,

2. We also use the document scheme here. The reason is similar:if we use the candidate scheme, e1 who viewed 10 sessions about“Java IO” in total 100 sessions may receive a lower score than e2

who viewed 2 in total 10 sessions for a query q related to “JavaIO”. Assuming the relevance scores of “Java IO” sessions are all0.9, it is easy to verify that, using the candidate scheme, p(q|e1) =∑

10

i=10.9× 0.01 = 0.09 and p(q|e2) =

∑2

i=10.9× 0.1 = 0.18

and (2) the “networking” research group in IBM T.J. Watson Research Center. The first dataset, Surf-Lab,consists of surfing data from 8 students in the datamining lab at UCSB. We run the “tcpdump” program(“windump” in windows) on each student’s PC, inorder to record their surfing activities. Afterwards,HTTP packets (only those generated by Web browers)and the corresponding textual contents were extractedfrom the dump files. The second dataset, Surf-IBM,is collected from the gateway at IBM T. J. WatsonResearch Center. Web surfing packets generated by 20research scientists in the “networking” research groupare captured. The Surf-IBM dataset is processed in asimilar way. Both datasets span nearly two months.The sequence of http contents of each person was seg-mented into sessions according to the following rule:we place a session boundary between two consecutivecontents if their timestamps are at least 10 minutesaway from each other, or their cosine similarity isbelow a threshold (0.5 in this work). Surf-Lab has 686sessions and Surf-IBM has 3,925 sessions.

TABLE 1

Queries for Surf-IBM (delimited by “;”).

Disruption tolerant network protocols; DoS attack;Intrusion detection vulnerability scanning; Cloudcomputing bigtable; Database data mirroring; Bigdata Big Blue; IBM websphere; Network topologyring; Network collapsed backbone; Mobile locationnetwork protocols; Virus malware; 802.11b protocol;Python programming; Machine learning jeopardy;Gateway antivirus program; Network traffic conges-tion; Security information Management authoriza-tion; DB2 database; Database data integrity; Networkvirtualization

7.2 Advisor Search

We first show the results of advisor search. Threeschemes are compared: session-based, micro-aspect-based and task-based, with an increasing granularity.The language model based expert search methodmentioned in Section 6 is used as the retrieval method.We have tried using other traditional expert searchmethods, but the results are very similar since theyall intrinsically accumulate relevance scores of associ-ated “documents” to candidates. For each scheme, alanguage model is constructed for each “document”,i.e. a session, a micro-aspect, or a task, by aggregatingall the texts belonging to it. Note that the session-based scheme is intrinsically applying the traditionallanguage model based expert search method on Websurfing data directly.

The evaluation methodology is as follows: we gen-erate 20 queries for each dataset. These queries repre-sent the fine-grained knowledge required by the spe-cific projects on which the candidates were working

Page 9: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 9

during the data collection period. Queries for Surf-IBM are shown in Table 1 as some examples. Theground truth labels of a query are obtained by show-ing candidates their top relevant sessions assessedby the language model method and asking them toassign a relevance score to themselves for the datacollection period on a scale from 0 to 2: (1) score=0means “irrelevant”; (2) score=1 means “partially rele-vant”, i.e. he/she has background knowledge for thequery; (3) score=2 means “very relevant”. NormalizedDiscount Cumulative Gain (NDCG) is used as theevaluation metric:

NDCG@n = Zn

n∑

i=1

2ri − 1

log2(i+ 1), (8)

where ri is the relevance score of the candidate atrank i and Zn is a normalization term to let theperfect ranking have a NDCG value of 1. We focus onn = 1, 2, 3, since it is costly for a user to verify manysuggested advisors. It is important to place correctpeople at highest positions.

The results are shown in Tables 2 and 3. Wecan see that the micro-aspect-based scheme outper-forms the other two schemes. One-tailed Wilcoxontest is performed based on all 40 queries to evaluatethe significance of the superiority. In particular, themicro-aspect-based scheme is significantly better thanthe session-based scheme at significance level 0.05,and significantly better than the task-based schemeat significance level 0.01. The task-based scheme isthe worst (significantly worse than the session-basedscheme at significance level 0.05). This is because atask can contain multiple micro-aspects which mightnot match a user’s need very well.

TABLE 2Comparison of three advisor search schemes on the

Surf-Lab dataset. The session-based scheme is

intrinsically applying the traditional language modelbased expert search method on Web surfing data

directly.

Method NDCG@1 NDCG@2 NDCG@3Micro-aspect-based .883 .918 .942

Session-based .783 .839 .885Task-based .783 .825 .841

Specifically, we find for one fifth of queries themicro-aspect-based scheme outperforms the session-based scheme. For other queries, they generate verysimilar rankings. We examine those one fifth ofqueries and find the situations conform to our analysisin Section 6. Specific investigation results for twoqueries are shown in Tables 4 and 5. Candidates areanonymized. For the two queries, the micro-aspect-based scheme ranks c1 and c3 at the top while thesession-based scheme ranks c2 and c4 at the top,respectively. In Surf-Lab, we found c2 was reading

TABLE 3Comparison of three advisor search schemes on the

Surf-IBM dataset. The session-based scheme is

intrinsically applying the traditional language modelbased expert search method on Web surfing data

directly.

Method NDCG@1 NDCG@2 NDCG@3Micro-aspect-based .900 .926 .928

Session-based .817 .885 .894Task-based .633 .666 .669

< Thinking in Java > and generated 45 sessions, noneof which was focused on the “multithreading” aspect.c1 generated 3 true relevant sessions and should beregarded as a better candidate. Similar situation isobserved for c3 and c4 with respect to the query“Mobile Location Network Protocols”.

TABLE 4

An example query from Surf-Lab where themicro-aspect-based scheme shows better

performance. “True rel / Generally rel” means the

number of sessions which are indeed relevant to thequery divided by the number of sessions which are

relevant to the general domain “Java”.

Q: Java multithreading

Candidate True rel / Generally relc1 3 / 4c2 0 / 45

TABLE 5

An example query from Surf-IBM where the

micro-aspect-based scheme shows betterperformance. “True rel / Generally rel” means the

number of sessions which are indeed relevant to the

query divided by the number of sessions which arerelevant to the general domain “Network Protocols”.

Q: Mobile Location Network Protocols

Candidate True rel / Generally relc3 8 / 13c4 1 / 66

7.3 Session Clustering

Instead of providing a comprehensive comparison ofdifferent clustering methods, which is not the focusof this work, the following experiment shows that theproposed GDP model with LE preprocessing (namedLEGDP) achieves its design goal. We implement twobaseline methods for comparison: (1) Spectral Clus-tering (SC), which is a popular clustering methodrequiring users to specify the number of clusters; (2)GDP with Principle Component Analysis (PCAGDP),which first uses PCA to reduce data dimensionality

Page 10: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 10

and then applies GDP on the subspace. We use Surf-Lab in this experiment.

The ground truth task labels were obtained man-ually. Each student involved in Surf-Lab was givenhis/her session data and was asked to assign a tasklabel to each session. Three popular evaluation met-rics for clustering are employed: Purity, F-measure andNormalized Mutual Information (NMI). Purity tries tomap each cluster to the class in the ground truthwhich is the most frequent in the cluster. It is definedas the accuracy of this map as follows:

Purity(Ω,Ξ) =1

n

K∑

k=1

maxj∈1,...,J

|ωk ∩ ξj |, (9)

where Ω = ω1, . . . , ωK is the set of clusters andΞ = ξ1, . . . , ξJ is the set of ground truth classes.F-measure operates on pairs of objects (i.e. sessions).The clustering result is treated as n(n−1)/2 decisions,each of which corresponds to a pair of objects anddecides whether they belong to the same class. Basedon this, the precision and recall scores are computedand the F-measure is computed as the harmonic meanof precision and recall. NMI is defined as

NMI(Ω,Ξ) =I(Ω,Ξ)

12 (H(Ω) +H(Ξ))

, (10)

where I(·) and H(·) represent mutual information andentropy, respectively.

TABLE 6Comparison of the numbers of tasks discovered by

LEGDP and PCAGDP.

MethodDataset

3p-Lab 5p-Lab Surf-LabGround Truth 33 47 61

LEGDP 37 45 59PCAGDP 19 38 44

7.3.1 Performance Comparison

First, we show that LEGDP can roughly capture thenumber of tasks. To this end, we derive two smalldatasets from Surf-Lab: (1) 3p-Lab, which contains 3people and 169 sessions; (2) 5p-Lab, which comprises5 people and 411 sessions. We tune model parameterson 3p-Lab (details in Section 7.3.2) and continue touse the same set of parameters on 5p-Lab and Surf-Lab. The results are shown in Table 6. As can beseen, LEGDP approximately captures the number oftasks. We also show the number of clusters learned byPCAGDP. PCAGDP cannot well capture the numberof tasks. The reason could be that PCA is a linear di-mensionality reduction method and may not capturethe complex topical variations of a task.

Second, we investigate the clustering performancemeasured by Purity, F1 and NMI. Table 7 showsthe results. For LEGDP and SC, we use the same

session affinity matrix W [27]. Therefore, the intrinsicdifference between LEGDP and SC is due to GDPand k-means. We set the parameter k for SC to thenumber of clusters learned by LEGDP for the sake offairness. We can see from Table 7 that both LEGDPand SC outperforms PCAGDP. This indicates thatthe nonlinear manifold structures do exist in tasks.PCA can only capture linear manifold structures andtherefore does not perform well. LEGDP and SC canachieve comparable performance, but SC needs theuser to pre-specify the parameter k.

7.3.2 Parameter Selection for LEGDP

The parameters of LEGDP include the number ofselected eigenvectors in LE (i.e. the dimensionalityD), the truncation level K of the variational distri-bution for the infinite Gaussian mixtures (see ap-pendix) and hyperparameters a0 and b0 of GDP whichencode the prior knowledge for the variance (vari-ance=1/precision) of each Gaussian component. Inthis section we explore how to set these parametersin practice.

We vary these parameters and investigate theirimpact on the number of learned clusters and theclustering performance (F1 and NMI). The results areshown in Figures 5 and 6. For the sake of clarity, weomit the results for 5p-Lab. The trends for 5p-Lab arevery similar to those shown in the figures. Figures 5(a)and 6(a) give the results for D. As can be seen, toosmall D indicates a subspace of low discriminativepower and therefore the number of clusters is toosmall and the performance is bad. On the contrary,too large D leads to the sparsity problem (i.e. curseof dimensionality). Hence, the algorithm learns toomany clusters and the performance also starts todecrease. We find D ∈ [25, 35] generally results ingood performance. a0 and b0 together give a prior forthe cluster precision λk , i.e. E(λkj) = a0/b0. Here wefix b0 = 1 and vary a0/b0 by taking the average of thesample precisions of all dimensions as the unit. Wefind there is a wide range of safe values, i.e. a0/b0between 80∗ and 150∗ (average sample precision),in which LEGDP can achieve reasonable numbersof clusters (Figure 5(b)) and good performance (Fig-ure 6(b)). This implies that the model behavior is notvery sensitive to prior hyperparameters. Finally, asshown in Figures 5(c) and 6(c), the parameter T showslittle influence on the modeling results, as long as itis assigned a value significantly higher than the realnumber of tasks.

7.4 Fine-grained Knowledge Mining

In this section, we show the results of fine-grainedknowledge mining by the d-iHMM model. We onlyshow the results for Surf-Lab. The observations onSurf-IBM are very similar. The basic iHMM modelis employed for comparison. Since it is difficult and

Page 11: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 11

TABLE 7Performance comparison of different clustering methods on the Surf-Lab dataset.

Method3p-Lab 5p-Lab Surf-Lab

Purity F1 NMI Purity F1 NMI Purity F1 NMILEGDP .768 .956 .772 .751 .953 .764 .667 .947 .698

SC .762 .948 .753 .721 .948 .733 .658 .933 .675PCAGDP .414 .562 .448 .457 .667 .484 .448 .685 .467

0 10 20 30 40 500

20

40

60

80

D (Dimension of subspace)

Num

ber

of clu

ste

rs

3p−LabSurf−Lab

(a)

0 50 100 1500

20

40

60

80

(a0/b

0) / (avg. sample precision)

Num

ber

of clu

ste

rs

3p−LabSurf−Lab

(b)

0 50 100 150 20010

20

30

40

50

60

70

80

K

Num

ber

of clu

ste

rs

3p−LabSurf−Lab

(c)

Fig. 5. Number of learned clusters by LEGDP when varying (a) D, (b) a0 / b0 and (c) K.

0 10 20 30 40 500.4

0.5

0.6

0.7

0.8

0.9

1

D (Dimension of subspace)

Perf

orm

ance

F1, 3p−Lab

NMI, 3p−Lab

F1, Surf−Lab

NMI, Surf−Lab

(a)

0 50 100 1500

0.2

0.4

0.6

0.8

1

(a0/b

0) / (avg. sample precision)

Perf

orm

ance

F1, 3p−LabNMI, 3p−LabF1, Surf−LabNMI, Surf−Lab

(b)

0 50 100 150 2000.2

0.4

0.6

0.8

1

K

Perf

orm

ance

F1, 3p−LabNMI, 3p−LabF1, Surf−LabNMI, Surf−Lab

(c)

Fig. 6. Performance of LEGDP when varying (a) D, (b) a0 / b0 and (c) K.

laborious to perform quantitative evaluation by man-ual labeling, the models are evaluated by an intuitivemeasure and case studies. The hyperparameters ofd-iHMM and iHMM are set as follows: h0 = 0.5,aα = 4, bα = 2, aγ0 = 3 and bγ0 = 6. The additionalhyperparameter δ0 of d-iHMM is set to 0.25.

Our goal is to learn different states (micro-aspects)in a textually cohesive task. The intuition is that thelearned states should have low intra-entropies andhigh inter-distances between them. Therefore, we usethe average state entropy divided by the average pair-wise distance between states as an intuitive measureto evaluate the results:

1S

∑p(−

∑i pi log(pi))

2S(S−1)

∑p

∑q 6=p

12 (KL(p, q) +KL(q, p))

, (11)

where S is the total number of states, KL(·) representsthe KL-Divergence function and p, q represent theempirical unigram distributions for two learned statesestimated by aggregating all the associated sessions.The unigram distributions are defined on the 100most frequent words in the task. A good set of states

0 20 40 510

1

2

3

4

5

Task No.

Entr

opy / D

iverg

ence

d−iHMMiHMM

Fig. 7. Comparison of d-iHMM and iHMM with respect

to the average state entropy divided by the averagepairwise KL-Divergence between states.

should have a low value of this measure. The resultsare shown in Figure 7. For some small-size tasks,either d-iHMM or iHMM outputs only one state. Theresults of these tasks are omitted since the measureis undefined. We can see from Figure 7 that d-iHMMcan learn better states for almost all the tasks.

The top five words of each learned state’s unigram

Page 12: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 12

TABLE 8Micro-aspects mined by d-iHMM for the task

“clustering study.”

Top five words of the unigram modelBackground clustering, endnote, pdf, uc, data

state 1 spectral, clustering, tutorial, algorithms, ma-trix

state 2 sequence, protein, sequences, oxford, pressstate 3 mean, nonparametric, shift, vision, blurringstate 4 detection, evolving, online, news, topicstate 5 pass, single, clustering, new, clusterstate 6 segment, audio, segments, broadcast, segmen-

tationstate 7 algorithm, spatial, density, noise, fuzzystate 8 smooth, time, gene, function, expression

TABLE 9Micro-aspects mined by iHMM for the task “clustering

study.”

Top five words of the unigram modelstate 1 clustering, spectral, pdf, endnote, tutorialstate 2 clustering, algorithm, fuzzy, uc, evolvingstate 3 clustering, segment, endnote, line, googlestate 4 clustering, segment, news, endnote, musicstate 5 clustering, nonparametric, mean, endnote,

shiftstate 6 endnote, clustering, pdf, algorithm, evolv-

ingstate 7 clustering, endnote, single, pass, sequencestate 8 clustering, cluster, pass, single, pdfstate 9 clustering, pdf, endnote, spectral, graph

state 10 clustering, smooth, endnote, uc, time

model for the task “clustering study” are shown inTables 8 (d-iHMM) and 9 (iHMM). Table 8 shows thatd-iHMM successfully summarizes different pieces offine-grained knowledge into different states. Theseinclude different clustering methods such as spectralclustering (state 1), single-pass clustering (state 5)and density based methods (state 7), and differentapplication settings, e.g. protein sequence clustering(state 2), news event detection (state 4). The iHMMmodel generates 10 states for “clustering study” (Ta-ble 9). We find iHMM tends to mess up differentmicro-aspects. For example, both state 3 and state 4contain “segment clustering”, and both state 7 andstate 8 have “single-pass clustering”. It sometimesmixes different micro-aspects, e.g. state 7 also contains“sequence clustering”. As another example, we showthe micro-aspects learned for the task “Programmingcontests” by d-iHMM and iHMM in Table 10 and 11,respectively. This task is due to a user who wasparticipating in Topcoder3 contests. Both models learnfour states for this task. We can see from Table 10 thatthis user was learning some classic algorithms anddata structures: sorting, maximum flow and black-red trees. The iHMM model again fails to separatedifferent micro-aspects (Table 11). The explanation is

3. http://www.topcoder.com/

that different micro-aspects under the same task sharecommon background textual characteristics. iHMMtries to model the background content in each stateindependently, which leads to low discriminativepower. On the contrary, d-iHMM has higher discrimi-native power by modeling background words in eachstate by a common background unigram model.

TABLE 10

Micro-aspects mined by d-iHMM for the task“Programming contests”.

Top five words of the unigram modelBackground contests, sort, review, black, tree

state 1 black, tree, node, red, nodesstate 2 flow, maximum, search, contests, capacitystate 3 sort, merge, list, insertion, sortingstate 4 class, review, contests, match, track

TABLE 11

Micro-aspects mined by iHMM for the task

“Programming contests”.

Top five words of the unigram modelstate 1 sort, black, tree, node, datastate 2 flow, contests, review, search, overviewstate 3 virtual, contests, review, class, algorithmstate 4 contests, sort, review, black, tree

7.5 Running time

We test the efficiency of the whole framework. Wecollect more Web surfing packets from IBM’s gatewayto form datasets of various sizes. The running time ofLEGDP and d-iHMM is shown in Figure 8. The testswere run in MATLAB on a PC with Intel Core i7 and8G memory. The running time of LEGDP seems linearw.r.t the number of sessions, while that of d-iHMMshows a slightly accelerated growth (due to M ). Theobservations conform to the analysis in Section 6. Weset the number of sampling iterations to 500 for d-iHMM. d-iHMM is more costly. Fortunately, the com-putation for different tasks can be parallelized. Theadvisor search phase only requires a few millisecondssince its main cost is one matrix vector multiplication.

0 5000 100000

10

20

30

Number of sessions

Tim

e (

s)

LEGDP

(a) LEGDP

0 50 10020

40

60

80

100

Number of sessions per task

Tim

e (

s)

d−iHMM

(b) d-iHMM

Fig. 8. Running time of LEGDP and d-iHMM.

Page 13: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 13

8 CONCLUSIONS

We introduced a novel problem, fine-grained knowl-edge sharing in collaborative environments, which isdesirable in practice. We identified digging out fine-grained knowledge reflected by people’s interactionswith the outside world as the key to solving thisproblem. We proposed a two-step framework to minefine-grained knowledge and integrated it with theclassic expert search method for finding right advi-sors. Experiments on real Web surfing data showedencouraging results.

There are open issues for this problem. (1) The fine-grained knowledge could have a hierarchical struc-ture. For example, “Java IO” can contain “File IO” and“Network IO” as sub-knowledge. We could iterativelyapply d-iHMM on the learned micro-aspects to derivea hierarchy, but how to search over this hierarchyis not a trivial problem. (2) The basic search modelcan be refined, e.g. incorporating the time factor sincepeople gradually forget as time flows. (3) Privacy isalso an issue. In this work, we demonstrate the fea-sibility of mining task micro-aspects for solving thisknowledge sharing problem. We leave these possibleimprovements to future work.

ACKNOWLEDGMENTS

This research was sponsored in part by the NationalNatural Science Foundation of China under GrantNo. 61373118, the Army Research Laboratory undercooperative agreements W911NF-09-2-0053 and NSFIIS 0954125. The views and conclusions containedherein are those of the authors and should not beinterpreted as representing the official policies, eitherexpressed or implied, of the Army Research Labora-tory or the U.S. Government. The U.S. Governmentis authorized to reproduce and distribute reprints forGovernment purposes notwithstanding any copyrightnotice herein. The authors would like to thank theeditor and reviewers for their constructive commentsthat helped improve this paper.

REFERENCES

[1] K. Balog, L. Azzopardi, and M. de Rijke. Formal models forexpert finding in enterprise corpora. In SIGIR, pages 43–50,2006.

[2] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The infinitehidden markov model. In Advances in neural informationprocessing systems, pages 577–584, 2001.

[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectraltechniques for embedding and clustering. In NIPS, 2001.

[4] D. Blei and M. Jordan. Variational inference for dirichletprocess mixtures. Bayesian Analysis, 1(1):121–143, 2006.

[5] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum.Hierarchical topic models and the nested chinese restaurantprocess. In NIPS, 2003.

[6] D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML,pages 113–120, 2006.

[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation. Journal of machine Learning research, 3:993–1022,2003.

[8] P. R. Carlile. Working knowledge: how organizations managewhat they know. Human Resource Planning, 21(4):58–60, 1998.

[9] N. Craswell, A. P. de Vries, and I. Soboroff. Overview of thetrec 2005 enterprise track. In TREC, 2005.

[10] H. Deng, I. King, and M. R. Lyu. Formal models for expertfinding on dblp bibliography data. In ICDM, pages 163–172,2009.

[11] Y. Fang, L. Si, and A. P. Mathur. Discriminative modelsof integrating document evidence and document-candidateassociations for expert search. In SIGIR, pages 683–690, 2010.

[12] T. S. Ferguson. A bayesian analysis of some nonparametricproblems. The Annals of Statistics, 1(2):209–230, 1973.

[13] A. K. Jain. Data clustering: 50 years beyond k-means. PatternRecognition Letters, 31(8):651–666, 2010.

[14] M. Ji, J. Yan, S. Gu, J. Han, X. He, W. Zhang, and Z. Chen.Learning search tasks in queries and web pages via graphregularization. In SIGIR, 2011.

[15] R. Jones and K. Klinkner. Beyond the session timeout: auto-matic hierarchical segmentation of search topics in query logs.In CIKM, pages 699–708, 2008.

[16] A. Kotov, P. Bennett, R. White, S. Dumais, and J. Teevan.Modeling and analysis of cross-session search tasks. In SIGIR,pages 5–14, 2011.

[17] R. Kumar and A. Tomkins. A characterization of onlinebrowsing behavior. In WWW, pages 561–570, 2010.

[18] J. Liu and N. Belkin. Personalizing information retrieval formulti-session tasks: The roles of task stage and task type. InSIGIR, pages 26–33, 2010.

[19] X. Liu, W. B. Croft, and M. Koll. Finding experts incommunity-based question-answering services. In CIKM,pages 315–316, 2005.

[20] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, andJ. M. Hellerstein. Distributed graphlab: a framework formachine learning and data mining in the cloud. In Proceedingsof the 38th international conference on very large databases, pages716–727, 2012.

[21] M. Medvedovic and S. Sivaganesan. Bayesian infinite mixturemodel based clustering of gene expression profiles. Bioinfor-matics, 18(9):1194–1206, 2002.

[22] R. M. Neal. Slice sampling. Annals of statistics, pages 705–741,2003.

[23] C. Rasmussen. The infinite gaussian mixture model. In NIPS,page 554C560, 2000.

[24] P. Serdyukov, H. Rode, and D. Hiemstra. Modeling multi-step relevance propagation for expert finding. In CIKM, pages1133–1142, 2008.

[25] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirich-let processes. Journal of the American Statistical Association,101(476):1566–1581, 2006.

[26] J. Van Gael, Y. Saatci, Y. Teh, and Z. Ghahramani. Beamsampling for the infinite hidden markov model. In ICML,pages 1088–1095, 2008.

[27] U. Von Luxburg. A tutorial on spectral clustering. Statisticsand computing, 17(4):395–416, 2007.

[28] H. Wang, Y. Song, M.-W. Chang, X. He, R. White, and W. Chu.Learning to extract cross-session search tasks. In WWW, pages1353–1364, 2013.

[29] R. White, P. Bailey, and L. Chen. Predicting user interests fromcontextual information. In SIGIR, pages 363–370, 2009.

[30] Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clusteringalgorithms for document datasets. Data Mining and KnowledgeDiscovery, 10(2):141–168, 2005.

Page 14: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …web.cse.ohio-state.edu/~sun.397/docs/advisorSearch-r.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 1, JANUARY 2007 14

Ziyu Guan received the BS and PhD de-grees in Computer Science from ZhejiangUniversity, China, in 2004 and 2010, respec-tively. He had worked as a research scien-tist in the University of California at SantaBarbara from 2010 to 2012. He is currentlya full professor in the College of Informationand Technology of China’s Northwest Univer-sity. His research interests include attributedgraph mining and search, machine learning,expertise modeling and retrieval, and recom-

mender systems.

Shengqi Yang is a PhD candidate at theComputer Science Department, UC SantaBarbara under the supervision of Prof. XifengYan. Prior to joining UCSB, he earned hisBS and MS degrees from Beijing Universityof Posts and Telecoms in 2007 and 2010,respectively. His research interests lie in abroad area of Data Management, with em-phasis on managing and querying over large-scale heterogeneous graphs or unstructureddata.

Huan Sun received the BS degree in Elec-tronic Engineering and Information Sciencefrom the University of Science and Tech-nology of China, in 2010. She is currentlyworking toward the PhD degree in computerscience at the University of California, SantaBarbara. Her research interests include sta-tistical machine learning, deep learning, anddata mining.

Mudhakar Srivatsa is a research staff mem-ber in Network Technologies Department atThomas J. Watson Research Center. He re-ceived his PhD in computer science fromGeorgia Tech. His research interests primar-ily include network analytics and secure in-formation flow. He serves a principal inves-tigator for Information Network Research inNetwork Science Collaborative TechnologyAlliance where he is working on data miningtechniques for co-evolving networks (social,

information, and communication).

Xifeng Yan is an associate professor at theUniversity of California at Santa Barbara. Heholds the Venkatesh Narayanamurti Chair inComputer Science. He received his Ph.D.degree in Computer Science from the Uni-versity of Illinois at Urbana-Champaign in2006. He was a research staff member at theIBM T. J. Watson Research Center between2006 and 2008. He has been working onmodeling, managing, and mining graphs inbioinformatics, social networks, information

networks, and computer systems. His works were extensively ref-erenced, with over 7,000 citations per Google Scholar. He receivedNSF CAREER Award, IBM Invention Achievement Award, ACM-SIGMOD Dissertation Runner-Up Award, and IEEE ICDM 10-yearHighest Impact Paper Award.