cambridge machine learning group | homepage - ranking...

32
Submitted to the Annals of Applied Statistics arXiv: stat.ME/0904.xxxx RANKING RELATIONS USING ANALOGIES IN BIOLOGICAL AND INFORMATION NETWORKS * By Ricardo Silva , Katherine Heller , Zoubin Ghahramani and Edoardo M. Airoldi § University College London , University of Cambridge and Harvard University § Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = {A (1) :B (1) ,A (2) :B (2) ,...,A (N) :B (N) }, measures how well other pairs A:B fit in with the set S. Our work addresses the question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity mea- sure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided. 1. Contribution. Many university admission exams, such as the Amer- ican Scholastic Assessment Test (SAT) and Graduate Record Exam (GRE), have historically included a section on analogical reasoning. A prototypical analogical reasoning question is as follows: doctor : hospital :: A) sports fan : stadium B) cow : farm C) professor : college D) criminal : jail * This work was supported in part by NSF grant no. DMS-0907009. AMS 2000 subject classifications: Primary 62H99; secondary 62P10 Keywords and phrases: Network analysis, Bayesian inference, variational approxima- tion, ranking, information retrieval, data integration, Saccharomyces cerevisiae 1

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

Submitted to the Annals of Applied StatisticsarXiv: stat.ME/0904.xxxx

RANKING RELATIONS USING ANALOGIES

IN BIOLOGICAL AND INFORMATION NETWORKS!

By Ricardo Silva†, Katherine Heller‡,

Zoubin Ghahramani‡ and Edoardo M. Airoldi§

University College London†, University of Cambridge ‡ and Harvard

University§

Analogical reasoning depends fundamentally on the ability tolearn and generalize about relations between objects. We develop anapproach to relational learning which, given a set of pairs of objectsS = {A(1):B(1), A(2):B(2), . . . , A(N):B(N)}, measures how well otherpairs A:B fit in with the set S. Our work addresses the question: isthe relation between objects A and B analogous to those relationsfound in S? Such questions are particularly relevant in informationretrieval, where an investigator might want to search for analogouspairs of objects that match the query set of interest. There are manyways in which objects can be related, making the task of measuringanalogies very challenging. Our approach combines a similarity mea-sure on function spaces with Bayesian analysis to produce a ranking.It requires data containing features of the objects of interest and alink matrix specifying which relationships exist; no further attributesof such relationships are necessary. We illustrate the potential of ourmethod on text analysis and information networks. An applicationon discovering functional interactions between pairs of proteins isdiscussed in detail, where we show that our approach can work inpractice even if a small set of protein pairs is provided.

1. Contribution. Many university admission exams, such as the Amer-ican Scholastic Assessment Test (SAT) and Graduate Record Exam (GRE),have historically included a section on analogical reasoning. A prototypicalanalogical reasoning question is as follows:

doctor : hospital ::

A) sports fan : stadium

B) cow : farm

C) professor : college

D) criminal : jail

!This work was supported in part by NSF grant no. DMS-0907009.AMS 2000 subject classifications: Primary 62H99; secondary 62P10Keywords and phrases: Network analysis, Bayesian inference, variational approxima-

tion, ranking, information retrieval, data integration, Saccharomyces cerevisiae

1

Page 2: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

2 SILVA ET AL.

E) food : grocery store

The examinee has to answer which of the five pairs best matches the re-lation implicit in doctor:hospital. Although all candidate pairs have sometype of relation, pair professor:college seems to best fit the notion of (pro-fession, place of work), or the “works in” relation implicit between doctorand hospital.

This problem is non-trivial because measuring the similarity between ob-jects directly is not an appropriate way of discovering analogies, as exten-sively discussed in the cognitive science literature. For instance, the analogybetween an electron spinning around the nucleus of an atom and a planet or-biting around the Sun is not justified by isolated, non-relational, comparisonsof an electron to a planet, and of an atomic nucleus to the Sun (Gentner,1983). Discovering the underlying relationship between the elements of eachpair is key in determining analogies.

1.1. Applications. This paper concerns practical problems of data anal-ysis where analogies, implicitly or not, play a role. One of our motivationscomes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a toolfor exploratory analysis of protein-protein interactions. Proteins have mul-

tiple functional roles in the cell, e.g., regulating metabolism, regulating cellcycle, among others. A protein often assumes di!erent functional roles whileinteracting with di!erent proteins. When a molecular biologist experimen-tally observes an interaction between two proteins, e.g., a binding event of{Pi, Pj}, it might not be clear which function that particular interaction iscontributing to. The bioPIXIE system allows a molecular biologist to inputa set S of proteins that are believed to have a particular functional role incommon, and generates a list of other proteins that are deduced to play thesame role. Evidence for such predictions is provided by a variety of sources,such as the expression levels for the genes that encode the proteins of inter-est and their cellular localization. Another important source of informationbioPIXIE takes advantage of is a matrix of relationships, indicating whichproteins interact according to some biological criterion. However, we do notnecessarily know which interactions correspond to which functional roles.

The application to protein interaction networks that we develop in Section5 shares some of the features and motivations of bioPIXIE. However, we aimat providing more detailed information. Our input set S is a small set of pairs

of proteins that are postulated to all play a common role, and we want torank other pairs Pi:Pj according to how similar they are with respect to

1http://pixie.princeton.edu/pixie/

Page 3: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 3

S. The goal is to automatically return pairs that correspond to analogousinteractions.

To use an analogy itself to explain our procedure, recall the SAT examplethat opened this section. The pair of words doctor:hospital presented in theSAT question play the role of a protein-protein interaction and is the smallestpossible case of S, i.e., a single pair. The five choices A–E in the SAT questioncorrespond to other observed protein-protein interactions we want to matchwith S, i.e., other possible pairs. Since multiple valid answers are possible,we rank them according to a similarity metric. In the application to proteininteractions, in Section 5, we perform thousands of queries and we evaluatethe goodness of the resulting rankings according to multiple gold standards,widely accepted by molecular and cellular biologists (Ashburner et al., 2000;Kanehisa and Goto, 2000; Mewes and et al., 2004).

The general problem of interest in this paper is a practical problem ofinformation retrieval (Manning et al., 2008) for exploratory data analysis:given a query set S of linked pairs, which other pairs of objects in my re-lational database are linked in a similar way? We apply this analysis tocases where it is not known how to explicitly describe the di!erent classesof relations, but good models to predict the existence of relationships areavailable. In Section 4, we consider an application to information retrievalin text documents for illustrative purposes. Given a set of pairs of webpageswhich are related by some hyperlink, we would like to find other pairs ofpages that are linked in a similar way. In information network settings, theproposed method could be useful, for instance, to answer queries for en-cyclopedia pages relating scientists and their major discoveries, to searchfor analogous concepts, or to identify the absence of analogous concepts,in Wikipedia. From an evaluation perspective, this application domain pro-vides an example where large scale evaluation is more straightforward thanin the biological setting.

In this paper, we introduce a method for ranking relations based on theBayesian similarity criterion underlying Bayesian sets, a method originallyproposed by Ghahramani and Heller (2005) and reviewed in Section 2. Incontrast to Bayesian sets, however, our method is tailored to drawing analo-gies between pairs of objects.

1.2. Related work. To give an idea of the type of data which our methodis useful for analyzing, consider the methods of Turney and Littman (2005)for automatically solving SAT problems. Their analysis is based on a largecorpus of documents extracted from the World Wide Web. Relations be-tween two words Wi and Wj are characterized by their joint co-ocurrence

Page 4: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

4 SILVA ET AL.

with other relevant words (such as particular prepositions) within a smallwindow of text. This defines a set of features for each Wi:Wj relationship,which can then be compared to other pairs of words using some notion ofsimilarity. Unlike in this application, however, there are often no (or veryfew) explicit features for the relationships of interest. Instead we need amethod for defining similarities using features of the objects in each rela-tionship, while at the same time avoiding the mistake of directly comparingobjects instead of relations.

One of the earliest approaches for determining analogical similarity wasintroduced by Rumelhart and Abrahamson (1973). In their paper, one isinitially given a set of pairwise distances between objects (say, by the sub-jective judgement of a group of people). Such distances are used to embedthe given objects in a latent space via a multidimensional scaling approach.A related pair A:B is then represented as a vector connecting A and B inthe latent space. Its similarity with respect to another pair C:D is defined bycomparing the direction and magnitude of the corresponding vectors. Ourapproach is probabilistic instead of geometrical, and operates directly on theobject features instead of pairwise distances.

We will focus solely on ranking pairwise relations. The idea can be ex-tended to more complex relations, but we will not pursue this here. Ourapproach is described in detail in Section 3.

Finally, the probabilistic, geometrical and logical approaches applied toanalogical reasoning problems can be seen as a type of relational data anal-ysis (Dzeroski and Lavrac, 2001; Getoor and Taskar, 2007). In particular,analogical reasoning is a part of the more general problem of generating la-tent relationships from relational data. Several approaches for this problemare discussed in Section 6. To the best of our knowledge, however, most ana-logical reasoning applications are interesting proofs of concept that tackleambitious problems such as planning (Veloso and Carbonell, 1993), or aremotivated as models of cognition (Gentner, 1983). Our goal is to create ano!-the-shelf method for practical exploratory data analysis.

2. A review of probabilistic information retrieval and the Bayesian

sets method. The goal of information retrieval is to provide data points(e.g., text documents, images, medical records) that are judged to be rele-vant to a particular query. Queries can be defined in a variety of ways, andin general they do not specify exactly which records should be presented. Inpractice, retrieval methods rank data points according to some measure ofsimilarity with respect to the query (Manning et al., 2008). Although queriescan, in practice, consist of any piece of information, for the purposes of this

Page 5: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 5

paper we will assume that queries are sets of objects of the same type wewant to retrieve.

Probabilities can be exploited as a measure of similarity. We will briefly re-view one standard probabilistic framework for information retrieval (Manning et al.,2008, Chapter 11). Let R be a binary random variable representing whetheran arbitrary data point X is “relevant” for a given query set S (R = 1)or not (R = 0). Let P (· | ·) be a generic probability mass function or den-sity function, with its meaning given by the context. Points are ranked indecreasing order by the following criterion

P (R = 1 | X, S)

P (R = 0 | X, S)=

P (R = 1 | S)

P (R = 0 | S)

P (X | R = 1, S)

P (X | R = 0, S),

which is equivalent to ranking points by the expression

(2.1) log P (X | R = 1, S) ! log P (X | R = 0, S)

The challenge is to define what form P (X | R = r, S) should assume. Itis not practical to collect labeled data in advance which, for every possibleclass of queries, will give an estimate for P (R = 1 | X, S): in general, onecannot anticipate which classes of queries will exist. Instead, a variety ofapproaches has been developed in the literature in order to define a suitableinstantiation of (2.1). These include a method that builds a classifier on-the-fly using S as elements of the positive class R = 1, and a random subset ofdata points as the negative class R = 0 (e.g., Turney, 2008b).

The Bayesian sets method of Ghahramani and Heller (2005) is a state-of-the-art probabilistic method for ranking objects, partially inspired byBayesian psychological models of generalization in human cognition (Tenenbaum and Gri"ths,2001). In this setup, the event “R = 1” is equated with the event that Xand the elements of S are i.i.d points generated by the same model. Theevent “R = 0” is the event by which X and S are generated by two indepen-dent models: one for X and another for S. The parameters of all models arerandom variables that have been integrated out, with fixed (and common)hyperparameters. The result is the instantiation of (2.1) as

(2.2) log P (X | S) ! log P (X) = logP (X, S)

P (X)P (S),

the Bayesian sets score function by which we rank points X given a queryS. The right-hand side was rearranged to provide a more intuitive graph-ical model, shown in Figure 1. From this graphical model interpretationwe can see that the score function is a Bayes factor comparing two models(Kass and Raftery, 1995).

Page 6: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

6 SILVA ET AL.

Θ

X1

α

X X X2 q...

ΘΘ

1

α

...X XX q2X

(a) (b)

Fig 1. In order to score how well an arbitrary element X fits in with query set S ={X1, X2, . . . , Xq}, the Bayesian sets methodology compares the marginal likelihood of themodel in (a), P (X, S), against the model in (b), P (X)P (S). In (a), the random parametervector ! is given a prior defined by the (fixed) hyperparameter !. The same (latent)parameter vector is shared by the query set and the new point. In (b), the parameter vector! that generates X is di!erent from the one that generates the query set.

In the next section, we describe how the Bayesian sets method can beadapted to define analogical similarity in the biological and informationnetworks settings we consider, and why such modifications are necessary.

3. A model of Bayesian analogical similarity for relations. Todefine an analogy is to define a measure of similarity between structures ofrelated objects. In our setting, we need to measure the similarity betweenpairs of objects. The key aspect that distinguishes our approach from othersis that we focus on the similarity between functions that map pairs to links,rather than focusing on the similarity between the features of objects in acandidate pair and the features of objects in the query pairs.

As an illustration, consider an analogical reasoning question from a SAT-like exam where for a given pair (say, water:river) we have to choose, outof 5 pairs, the one that best matches the type of relation implicit in such a“query.” In this case, it is reasonable to say car:highway would be a bettermatch than (the somewhat nonsensical) soda:ocean, since cars flow on ahighway, and so does water in a river. Notice that if we were to measure thesimilarity between objects instead of relations, soda:ocean would be a muchcloser pair, since soda is similar to water, and ocean is similar to river.

Nevertheless, it is legitimate to infer relational similarity from individ-ual object features, as summarized by Gentner and Medina (1998) in their“kind world hypothesis.” What is needed is a mechanism by which objectfeatures should be weighted in a particular relational similarity problem.We postulate that, in analogical reasoning, similarity between features of

Page 7: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 7

objects is only meaningful to the extent by which such features are usefulto predict the existence of the relationships.

Our approach can be described as follows. Let A and B represent objectspaces. To say that an interaction A:B is analogous to S = {A(1):B(1), A(2):B(2),. . . , A(N):B(N)} amounts to implicitly defining a measure of similarity be-tween the pair A:B and the set of pairs S, where each query item A(k):B(k)

corresponds to some pair Ai:Bj. However, this similarity is not directly de-rived from the similarity of the information contained in the distribution ofobjects themselves, {Ai} " A, {Bi} " B. Rather, the similarity betweenA:B and the set S is defined in terms of the similarity of the functions map-ping the pairs as being linked. Each possible function captures a di!erentpossible relationship between the objects in the pair.

Bayesian analogical reasoning formulation: Consider a space of la-tent functions in A#B $ {0, 1}. Assume that A and B are two objectsclassified as linked by some unknown function f(A,B), i.e., f(A,B) = 1.We want to quantify how similar the function f(A,B) is to the func-tion g(·, ·), which classifies all pairs (Ai, Bj) % S as being linked, i.e.,where g(Ai, Bj) = 1. The similarity should depend on the observations{S, A,B} and our prior distribution over f(·, ·) and g(·, ·).

Functions f(·) and g(·) are unobserved, hence the need for a prior thatwill be used to integrate over the function space. Our similarity metric willbe defined using Bayes factors, as explained next.

3.1. Analogy in function spaces via logistic regression. For simplicity,we will consider a family of latent functions that is parameterized by afinite-dimensional vector: the logistic regression function with multivariateGaussian priors for its parameters.

For a particular pair (Ai % A, Bj % B), let Xij = [#1(Ai, Bj) #2(Ai, Bj). . . #K(Ai, Bj)]T be a point on a feature space defined by the mapping# : A # B $ &K . This feature space mapping computes a K-dimensionalvector of attributes of the pair that may be potentially relevant to predictingthe relation between the objects in the pair. Let Lij % {0, 1} be an indicatorof the existence of a link or relation between Ai and Bj in the database. Let$ = [!1, . . . , !K ]T be the parameter vector for our logistic regression modelsuch that

(3.1) P (Lij = 1|Xij , $) = logistic($TXij)

where logistic(x) = (1 + e!x)!1.

Page 8: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

8 SILVA ET AL.

Θ

A1 ... ... BBB1Ap qA B A2 2i j

... ...L L L Lij p22q13

α

ΘΘ

1... ... BBB1Ap qA B A2 2i j

... ...L L L Lij p22q13

α

A

(a) (b)

Fig 2. The score of a new data point {Ai, Bj} is given by the Bayes factor that comparesmodels (a) and (b). Node ! represents the hyperparameters for !. In (a), the generativemodel is the same for both the new point and the query set represented in the rectangle.Notice that our conditioning set S of pairs might contain repeated instances of a samepoint, i.e., some A or B might appear multiple times in di!erent relations, as illustratedby nodes with multiple outgoing edges. In (b), the new point and the query set do not sharethe same parameters.

We now apply the same score function underlying the Bayesian setsmethodology explained in Section 2. However, instead of comparing objectsby marginalizing over the parameters of their feature distributions, we com-pare functions for link indicators by marginalizing over the parameters ofthe functions.

Let LS be the vector of link indicators for S: in fact each L % LS has thevalue L = 1, indicating that every pair of objects in S is linked. Considerthe following Bayes factor:

(3.2)P (Lij = 1, LS = 1 | Xij ,S)

P (Lij = 1 | Xij)P (LS = 1 | S)

This is an adaptation of Equation (2.2) where relevance is defined now bywhether Lij and LS were generated by the same model, for fixed {Xij ,S}.In one sense, this is a discriminative Bayesian sets model, where we pre-dict links instead of modeling joint object features. Since we are integratingout $, a prior for this parameter vector is needed. The graphical modelscorresponding to this Bayes factor are illustrated in Figure 2.

Thus, each pair (Ai, Bj) is evaluated with respect to a query set S by thescore function given in (3.2), rewritten after taking a logarithm and droppingconstants as:

(3.3)score(Ai, Bj) = log P (Lij = 1 | Xij ,S,LS = 1)

! log P (Lij = 1 | Xij)

The exact details of our procedure are as follows. We are given a relationaldatabase (DA,DB ,LAB). Dataset DA (DB) is a sample of objects of type A

Page 9: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 9

(B). Relationship table LAB is a binary matrix modeled as generated froma logistic regression model of link existence. A query proceeds according tothe following steps:

1. the user selects a set of pairs S that are linked in the database, wherethe pairs in S are assumed to have some relation of interest;

2. the system performs Bayesian inference to obtain the correspondingposterior distribution for $, P ($|S,LS), given a Gaussian prior P ($);

3. the system iterates through all linked pairs, computing the followingfor each pair

P (Lij = 1|Xij ,S,LS = 1) =!

P (Lij = 1|Xij , $)P ($|S,LS = 1) d$

P (Lij = 1|Xij) is similarly computed by integrating over P ($). Allpairs are presented in decreasing order according to the score in Equa-tion (3.3);

The integral presented above does not have a closed formula. Becausecomputing the integrals by a Monte Carlo method for a large number of pairswould be unreasonable, we use a variational approximation (Jordan et al.,1999; Airoldi, 2007). Figure 3 presents a summary of the approach.

The suggested setup scales as O(K3) with the feature space dimension,due to the matrix inversions necessary for (variational) Bayesian logisticregression (Jaakkola and Jordan, 2000). A less precise approximation toP ($|S,LS) can be imposed if the dimensionality of $ is too high. How-ever, it is important to point out that once the initial integral P ($ | S,LS)is approximated, each score function can be computed at a cost of O(K2).

Our analogical reasoning formulation is a relational model in that it mod-els the presence and absence of interactions between objects. By conditioningon the link indicators, the similarity score between A:B and C:D is alwaysa function of pairs (A,B) and (C,D) that is not in general decomposable assimilarities between A and C, and B and D.

3.2. Comparison with Bayesian sets and stochastic block models. Themodel presented in Figure 2 is a conditional independence model for rela-tionship indicators, i.e., given object features and parameters, the entries ofLD are independent. However, the entries in LD are in general marginally

dependent. Since this is a model of relationships given object attributes, wecall the model introduced here the relational Bayesian sets model.

Our approach has some similarity to the so called stochastic blockmod-

els. These models were developed four decades ago in the network litera-ture to quantify the notion of “structural equivalence” by means of blocks

Page 10: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

10 SILVA ET AL.

Predictive posteriors P(D(i) | S)

Data D+

(Unlinked pairs)Data D!

Marginal likelihoods P(D(i))

Ranked pairs

Model prior P($)

(Linked pairs)Small query S

(Linked pairs)

Fig 3. General framework of the procedure: first, a “prior” over parameters ! for alink classifier is defined empirically using linked and unlinked pairs of points (the dashededges indicate that creating a prior empirically is optional, but in practice we rely on thismethod). Given a query set S of linked pairs of interest, the system computes the predic-tive likelihood of each linked pair D(i) ! D+ and compares it to the conditional predictivelikelihood, given the query. This defines a measure of similarity with respect to S by whichall pairs in D+ are sorted.

nodes that instantiate similar connectivity patterns (Lorrain and White,1971; Holland and Leinhardt, 1975). Modern stochastic blockmodel approaches,in statistics and machine learning, build on these seminal works by introduc-ing the discovery of the block structure as part of the model search strat-egy (Fienberg et al., 1985; Nowicki and Snijders, 2001; Kemp et al., 2006;Xu et al., 2006; Airoldi et al., 2005, 2008; Ho!, 2008). The observed featuresin our approach, Xij , e!ectively play the same role as the latent indicatorsin stochastic block models2. Since Xij is observed, there is no need to inte-grate over the feature space to obtain the posterior distribution of $. Thiscomputational e"ciency is particularly relevant in information retrieval andexploratory data analysis, where users expect a relatively short responsetime.

As an alternative to our relational Bayesian sets approach, consider thefollowing direct modification of the standard Bayesian sets formulation tothis problem: merge the datasets DA and DB into a single dataset, creatingfor each pair (Ai, Bj) a row in the database with an extra binary indica-

2In a stochastic block model, typically each object has a single feature " indicatingmembership to some latent class. For a pair Ai, Bj , the corresponding feature vector Xij

would be ("A, "B).

Page 11: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 11

tor of relationship existence. Create a joint model for pairs by using themarginal models for A and B and treating di!erent rows as being indepen-dent. This ignores the fact that the resulting merged data points are notreally i.i.d. under such a model, because the same object might appear inmultiple relations (Dzeroski and Lavrac, 2001). The model also fails to cap-ture the dependency between Ai and Bj that arises from conditioning onLij , even if Ai and Bj are marginally independent. Nevertheless, heuris-tically this approach can sometimes produce good results, and for severaltypes of probability families it is very computationally e"cient. We evaluateit in Section 4.

3.3. Choice of features and relational discrimination. Our setup assumesthat the feature space # provides a reasonable classifier to predict theexistence of links. Useful predictive features can also be generated auto-matically with a variety of algorithms (e.g., the “structural logistic regres-sion” of Popescul and Ungar, 2003). See also Dzeroski and Lavrac (2001).Jensen and Neville (2002) discuss shortcomings of methods for automatedfeature selection in relational classification.

We also assume feature spaces are the same for all possible combinationsof objects. This allows for comparisons between, e.g., cells from di!erentspecies, or webpages from di!erent web domains, as long as features aregenerated by the same function #(·, ·). In general, we would like to relaxthis requirement, but for the problem to be well-defined, features from thedi!erent spaces must be related somehow. A hierarchical Bayesian formu-lation for linking di!erent feature spaces is one possibility which might betreated in a future work.

3.4. Priors. The choice of prior is based on the observed data, in a waythat is equivalent to the choice of priors used in the original formulation ofBayesian sets (Ghahramani and Heller, 2005). Let "$ be the maximum likeli-hood estimator of $ using the relational database (DA,DB ,LAB). Since thenumber of possible pairs grows at a quadratic rate with the number of ob-jects, we do not use the whole database for maximum likelihood estimation.Instead, to get "$ we use all linked pairs as members of the “positive” class(L = 1), and subsample unlinked pairs as members of the “negative” class(L = 0). We subsample by sampling each object uniformly at random fromthe respective datasets DA and DB to get a new pair. Since link matricesLAB are usually very sparse, in practice this will almost always provide anunlinked pair. Sections 4 and 5 provide more details.

We use the prior P ($) = N ("$, (c"T)!1), where N (m,V) is a normalof mean m and variance V. Matrix "T is the empirical second moments

Page 12: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

12 SILVA ET AL.

matrix of the linked object features, although a di!erent choice might beadequate for di!erent applications. Constant c is a smoothing parameter setby the user. In all of our experiments, we set c to be equal to the number ofpositive pairs. A good choice of c might be important to obtain maximumperformance, but we leave this issue as future work. Wang et al. (2009)present some sensitivity analysis results for a particular application in textanalysis.

Empirical priors are a sensible choice, since this is a retrieval, not a predic-tive, task. Basically, the entire data set is the population, from which priorinformation is obtained on possible query sets. A data-dependent prior basedon the population is important for an approach such as Bayesian sets, sincedeviances from the “average” behaviour in the data are useful to discrimi-nate between subpopulations.

3.5. On continuous and multivariate relations. Although we focus onmeasuring similarity of qualitative relationships, the same idea could be ex-tended to continuous (or ordinal) measures of relationship, or relationshipswhere each Lij is a vector. For instance, Turney and Littman (2005) mea-sure relations between words by their co-occurrences on the neighborhoodof specific keywords, such as the frequency of two words being connected bya specific preposition in a large body of text documents. Several similaritymetrics can be defined on this vector of continuous relationships. However,given data on word features, one can easily modify our approach by sub-stituting the logistic regression component with some multiple regressionmodel.

4. Ranking hyperlinks on the web. In the following application, weconsider a collection of webpages from several universities: the WebKB col-lection, where relations are given by hyperlinks (Craven et al., 1998). Web-pages are classified as being of type course, department, faculty, project, sta!,student or other. Documents come from four universities (cornell, texas,washington and wisconsin). We are interested in recovering pairs of web-pages {A,B} where webpage A has a link to webpage B. Notice that therelationship is asymmetric. Di!erent types of webpages imply di!erent typesof links. For instance, a faculty webpage linking to a project webpage con-stitutes a type of link. The analogical reasoning task here is simplified ifwe assume each webpage object has a single role (i.e., exactly one out ofthe pre-defined types {course, department, faculty, project, sta!, student,other}), and therefore a pair of webpages implies a unique type of relation-ship. The webpage types are for evaluation purposes only, as we explainlater: we will not provide this information to the model.

Page 13: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 13

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prec

ision

Recall

student -> course (cornell)

RBSetsSBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

student -> course (texas)

RBSetsSBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prec

ision

Recall

student -> course (washington)

RBSetsSBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

student -> course (wisconsin)

RBSetsSBSets1SBSets2Cosine1Cosine2

Fig 4. Results for student " course relationships.

Our main standard of comparison is a “flattened Bayesian sets” algo-rithm (which we will call “standard Bayesian sets,” SBSets, in constrast tothe relational model, RBSets). Using a multivariate independent Bernoullimodel as in the original paper (Ghahramani and Heller, 2005), we mergelinked webpage pairs into single rows, and then apply the original algorithmdirectly to the merged data. It is clear that datapoints are not independentanymore, but the SBSets algorithm assumes this is the case. Evaluatingthis algorithm serves the purpose of both measuring the loss of not treatingrelational data as such, as well as the limitations of evaluating the similarityof pairs through models for the marginal probabilities of A and B insteadof models for the predictive function P (Lij | Xij).

Binary data was extracted from this database using the same methodol-ogy as in Ghahramani and Heller (2005). A total of 19,450 binary variablesper object are generated, where each variable indicates whether a word froma fixed dictionary appears in a given document more frequently than the av-erage. To avoid introducing extra approximations into RBSets, we reducedthe dimensionality of the original representation using singular value decom-position, obtaining 25 measures per object.

Page 14: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

14 SILVA ET AL.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prec

ision

Recall

faculty -> project (cornell)RBSets

SBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

faculty -> project (texas)RBSets

SBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prec

ision

Recall

faculty -> project (washington)RBSets

SBSets1SBSets2Cosine1Cosine2

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

faculty -> project (wisconsin)RBSets

SBSets1SBSets2Cosine1Cosine2

Fig 5. Results for faculty " project relationships.

In this experiment, objects are of the same type, and therefore, dimen-sionality. The feature vector Xij for each pair of objects {Ai, Bj} consistsof the V features for object Ai, the V features of object Bj, and measuresZ = {Z1, . . . , ZV }, where Zv = (Ai

v # Bjv)/('A

i' # 'Bj'), 'Ai' being theEuclidean norm of the V -dimensional representation of Ai. We also add aconstant value (1) to the feature set as an intercept term for the logisticregression. Feature set Z is exactly the one used in the cosine distance mea-sure3, a common and practical measure widely used in information retrieval(Manning et al., 2008). This feature space also has the important advan-tage of scaling well (linearly) with the number of variables in the database.Moreover, adopting such features will make our comparisons fairer, since weevaluate how well cosine distance itself performs in our task. Notice that ourchoice of Xij is suitable for asymmetric relationships, as naturally occurs inthe domain of webpage links. For symmetric relationships, features such as|Ai

v ! Bjv| could be used instead.

In order to set the empirical prior, we sample 10 “negative” pairs for

3The cosine similarity measure between two items corresponds to the sum of the fea-tures in Z.

Page 15: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 15

C1 C2 RB SB1 SB2 C1 C2 RB SB1 SB2

student " course faculty " project

cornell 0.87 0.82 0.87 0.82 0.80 0.19 0.18 0.24 0.18 0.18texas 0.62 0.32 0.77 0.55 0.54 0.24 0.21 0.29 0.12 0.12

washington 0.69 0.31 0.76 0.67 0.64 0.40 0.42 0.47 0.40 0.40wisconsin 0.77 0.72 0.88 0.75 0.73 0.28 0.30 0.26 0.19 0.21

Table 1

Area under the precision/recall curve for each algorithm and query.

each “positive” one, and weight them to reflect the proportion of linked tounlinked pairs in the database. That is, in the WebKB study we use 10negatives for each positive, and we count each negative case as being 350cases replicated. We perform subsampling and reweighting in order to beable to fit the database in the memory of a desktop computer.

Evaluation of the significance of retrieved items often relies on subjectiveassessments (Ghahramani and Heller, 2005). To simplify our study, we willfocus on particular setups where objective measures of success are defined.

To evaluate the gain of our model over competitors, we will use the follow-ing setup. In the first query, we are given all pairs of webpages of the typestudent $ course from three of the labeled universities, and evaluate howrelations are ranked in the fourth university. Because we know class labelsfor the webpages (while the algorithm does not), we can use the classes ofthe returned pairs to label a hit as being “relevant” or “irrelevant.” We labela pair (Ai, Bj) as relevant if and only if Ai is of type student and Bj is oftype course, and Ai links to Bj.

This is a very stringent criterion, since other types of relations could alsobe valid (e.g., sta! $ course appears to be a reasonable match). However,this facilitates objective comparisons of algorithms. Also, the other classcontains many types of pages, which allows for possibilities such as a student

$ “hobby” pair. Such pairs might be hard to evaluate (e.g., is that particularhobby demanding or challenging in a similar way to coursework?) As acompromise, we omit all pages from the category other in order to betterclarify di!erences between algorithms4.

Precision/recall curves (Manning et al., 2008) for the student $ course

queries are shown in Figure 4. There are four queries, each corresponding toa search over a specific university given all valid student $ course pairs fromthe other three. There are four algorithms on each evaluation: the standardBayesian sets with the original 19,450 binary variables for each object, plus

4As an extreme example, querying student " course pairs from the wisconsin universityreturned student " other pairs at the top four. However, these other pages were for somereason course pages - such as http://www.cs.wisc.edu/#markhill/cs752.html

Page 16: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

16 SILVA ET AL.

another 19,450 binary variables, each corresponding to the product of therespective variables in the original pair of objects (SBSets1); the standardBayesian sets with the original binary variables only (SBSets2); a standardcosine distance measure over the 25-dimensional representation (Cosine 1)for each page, with pairs being given by the combined vector of 50 features;a cosine distance measure using the raw 19,450-dimensional binary for eachdocument (Cosine 2); our approach, RBSets.

In Figure 4, RBSets demonstrates consistently superior or equal precision-recall. Although SBSets performs well when asked to retrieve only student

items or only course items, it falls short of detecting what features of student

and course are relevant to predict a link. The discriminative model withinRBSets conveys this information through the link parameters.

We also did an experiment with a query of type faculty $ project, shownin Figure 5. This time results between algorithms were closer to each other.To make di!erences more evident, we adopt a slightly di!erent measure ofsuccess: we count as a 1 hit if the pair retrieved is a faculty $ project pair,and count as a 0.5 hit for pairs of type student $ project and sta!$ project.Notice this is a much harder query. For instance, the structure of the project

webpages in the texas group was quite distinct from the other universities:they are mostly very short, basically containing links for members of theproject and other project webpages.

Although the precision/recall curves convey a global picture of the per-formance of each algorithm, they might not be a completely clear way ofranking approaches for cases where curves intersect at several points. Inorder to summarize algorithm performances with a single statistic, we com-puted the area under each precision/recall curve (with linear interpolationbetween points). Results are given in Table 1. Numbers in bold indicate thelargest area under the curve. The dominance of RBSets should be clear.

5. Ranking protein interactions. The budding yeast is a unicellularorganism that has become a de-facto model organism for the study of molec-ular and cellular biology (Botstein et al., 1997). There are about 6,000 pro-teins in the budding yeast, which interact in a number of ways (Cherry et al.,1997). For instance, proteins bind together to form protein complexes, thephysical units that carry out most functions in the cell (Krogan et al., 2006).In recent years, significant resources have been directed to collect exper-imental evidence of physical proteins binding, in an e!ort to infer andcatalogue protein complexes and their multifaceted functional roles (e.g.Fields and Song, 1989; Ito et al., 2000; Uetz et al., 2000; Gavin et al., 2002;Ho et al., 2002). Currently, there are four main sources of interactions be-

Page 17: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 17

tween pairs of proteins that target proteins localized in di!erent cellularcompartments with variable degrees of success: (i) literature curated inter-actions (Reguly et al., 2006), (ii) yeast two-hybrid (Y2H) interaction assays(Yu et al., 2008), (iii) protein fragment complementation (PCA) interactionassays (Tarassov et al., 2008), and (iv) tandem a"nity purification (TAP)interaction assays (Gavin et al., 2006; Krogan et al., 2006). These collec-tions include a total of about 12,292 protein interactions (Jensen and Bork,2008), although the number of such interactions is estimated to be between18,000 (Yu et al., 2008) and 30,000 (von Mering et al., 2002).

Statistical methods have been developed for analyzing many aspects ofthis large protein interaction network, including de-noising (Bernard et al.,2007; Airoldi et al., 2008), function prediction (Nabieva et al., 2005), andidentification of binding motifs (Banks et al., 2008).

5.1. Overview of the analysis. We consider multiple functional catego-rization systems for the proteins in budding yeast. For evaluation purposes,we use individual proteins’ functional annotations curated by the Munich In-stitute for Protein Sequencing (MIPS, Mewes and et al., 2004), those by theKyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa and Goto,2000), and those by the Gene Ontology consortium (GO, Ashburner et al.,2000). We consider multiple collections of physical protein interactions thatencode alternative semantics. Physical protein-to-protein interactions in theMIPS curated collection measure physical binding events observed experi-mentally in Y2H and TAP experiments, whereas physical protein-to-proteininteractions in the KEGG curated collection measure a number of di!erentmodes of interactions, including phosporelation, methylation and physicalbinding, all taking place in the context of a specific signaling pathway. So wehave three possible functional annotation databases (MIPS, KEGG and GO)and two possible link matrices (MIPS and KEGG), which can be combined.

Our experimental pipeline is as follows. (i) Pick a database of functionalannotations, say MIPS, and a collection of interactions, say MIPS (again).(ii) Pick a pair of categories, M1 and M2. For instance, take M1 to be cy-

toplasm (MIPS 40.03) and M2 to be cytoplasmic and nuclear degradation

(MIPS 06.13.01). (iii) Sample, uniformly at random and without replace-ment, a set S of 15 interactions in the chosen collection. (iv) Rank otherinteracting pairs5 according to the score in Equation 3.3 and, for compari-son purposes, according to three other approaches to be described in Section5.1.4. (v) The process is repeated for a large number of pairs M1#M2, and 5

5The portion of ranked list that is relevant for evaluation purposes is limited to a subsetof the protein-protein interactions. More detailes are given in Section 5.1.3.

Page 18: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

18 SILVA ET AL.

No. Measurements description Data sources

1. Expression microarrays Gasch et al. (2000); Brem et al. (2005)Primig et al. (2000); Yvert et al. (2003)

2. Synthetic genetic interactions Breitkreutz et al. (2003); SGD3. Cellular localization Huh et al. (2003)4. Transcription factor binding sites Harbison et al. (2004); TRANSFAC5. Sequence similarities Altschul et al. (1990); Zhu and Zhang (1999)

Table 2

Collection of data sets used to generate protein-specific features.

di!erent query sets S are generated for each pair of categories. (vi) Calculatean evaluation metric for each query and each of the four scores, and reporta comparative summary of the results.

5.1.1. Protein-specific features. The protein-specific features were gen-erated using the data sets summarized in Table 2 and an additional dataset (Qi et al., 2006). 20 gene expression attributes were obtained from thedataset processed by Qi et al. (2006). Each gene expression attribute for aprotein pair Pi:Pj corresponds the correlation coe"cient between the expres-sion levels of corresponding genes. The 20 di!erent attributes are obtainedfrom 20 di!erent experimental conditions as measured by microarrays. Wedid not use pairs of proteins from Qi et al. for which we did not have datain the data sets listed in Table 2. This resulted in approximately 6,000 pos-itively linked data points for the MIPS network and 39,000 for KEGG.

We generated another 25 protein-protein gene expression features fromthe data in Table 2 using the same procedure based on correlation coe"-cients. This gives a total of 45 attributes, corresponding to the main datasetused in our relational Bayesian sets runs.

Another dataset was generated using the remaining (i.e., non-microarray)features of Table 2. Such features are binary and highly sparse, with mostentries being 0 for the majority of linked pairs. We removed attributes forwhich we had fewer than 20 linked pairs with positive values according tothe MIPS network. The total number of extra binary attributes was 16.

Several measurements were missing. We imputed missing values for eachvariable in a particular datapoint by using its empirical average among theobserved values.

Given the 45 or 61 attributes of a given pair {Pi, Pj}, we applied a non-linear transformation where we normalize the vector by its Euclidean normin order to obtain our feature table X.

5.1.2. Calibrating the prior for $. We initially fit a logistic regressionclassifier using a maximum likelihood estimation (MLE) and our data, ob-

Page 19: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 19

taining the estimate "$. Our choice of covariance matrix "% for $ is definedto be a rescaling of a squared norm of the data:

(5.1) ("%)!1 = XTPOSXPOS

where XPOS is the matrix containing the protein-protein features only ofthe linked pairs used in the MLE computation.

5.1.3. Evaluation metrics. As in the WebKB experiment, we propose anobjective measure of evaluation that is used to compare di!erent algorithms.Consider a query set S, and a ranked response list R = {R1, R2, R3, . . . , RN}of protein-protein pairs. Every element of S is a pair of proteins Pi:Pj suchthat Pi is of class Mi and Pj is of class Mj , where Mi and Mj are classesfrom either MIPS, KEGG or Gene Ontology. In general, proteins belongto multiple classes. This is in contrast with the WebKB experiment, whereaccording to our webpage categorization there was only one possible type ofrelationship for each pair of webpages. The retrieval algorithm that generatesR does not receive any information concerning the MIPS, KEGG or GOtaxonomy. R starts with the linked protein pair that is judged most similarto S, followed by the other protein pairs in the population ,in decreasingorder of similarity. Each algorithm has its own measure of similarity.

The evaluation criterion for each algorithm is as follows: as before, wegenerate a precision-recall curve and calculate the area under the curve(AUC). We also calculate the proportion (TOP10), among the top 10 ele-ments in each ranking, of pairs that match the original {M1,M2} selection(i.e., a “correct” Pi:Pj is one where Pi is of class M1 and Pj of class M2,or vice-versa. Notice that each protein belongs to multiple classes, so bothconditions might be satisfied.) Since a researcher is only likely to look at thetop ranked pairs, it makes sense to define a measure that uses only a subsetof the ranking. AUC and TOP10 are our two evaluation measures.

The original classes {M1,M2} are known to the experimenter but notknown to the algorithms. As in the WebKB experiment, our criterion israther stringent, in the sense it requires a perfect match of each RI with theMIPS, KEGG or GO categorization. There are several ways by which a pairRI might be analogous to the relation implicit in S, and they do not needto agree with MIPS, GO or KEGG. Still, if we are willing to believe thatthese standard categorization systems capture functional organization ofproteins at some level, this must lead to association between categories givento S and relevant sub-populations of protein-protein interactions similarto S. Therefore, the corresponding AUC and TOP10 are useful tools forcomparing di!erent algorithms even if the actual measures are likely to bepessimistic for a fixed algorithm.

Page 20: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

20 SILVA ET AL.

5.1.4. Competing algorithms. We compare our method against a variantof it and two similarity metrics widely used for information retrieval.

1. The cosine score (Manning et al., 2008), denoted by cos.2. The nearest neighbor score, denoted by nns.3. The relational maximum likelihood sets score, denoted by mls.

The nearest neighbor score measures the minimum Euclidean distance be-tween RI and any individual point in S, for a given query set S and a givencandidate point RI . The relational maximum likelihood sets is a variationof RBSets where we initially sample a subset of the unlinked pairs (10,000points in our setup) and, for each query S, we fit a logistic regression modelto obtain the parameter estimate $MLE

S. We also use a logistic regression

model fit to the whole dataset (the same one used to generate the prior forRBSets), giving the estimate $MLE. A new score, analogous to (3.3), isgiven by log P (Lij = 1|Xij , $MLE

S) ! log P (Lij = 1|Xij , $MLE), i.e., we do

not integrate out the parameters or use a prior, but instead the models arefixed at their respective estimates.

Neither cos or nns can be interpreted as measures of analogical simi-larity, in the sense that they do not take into account how the protein pairfeatures X contribute to their interaction6. It is true that a direct measure ofanalogical similarity is not theoretically required to perform well accordingto our (non-analogical) evaluation metric. However, we will see that therepractical advantages in doing so.

5.2. Results on the MIPS collection of physical interactions. For thisbatch of experiments, we use the MIPS network of protein-protein inter-actions to define the relationships. In the initial experiment, we selectedqueries from all combinations of MIPS classes for which there were at least50 linked pairs Pi:Pj in the network that satisfied the choice of classes. Eachquery set contained 15 pairs. After removing the MIPS-categorized proteinsfor which we had no feature data, we ended up with a total of 6125 proteinsand 7788 positive interactions. We set the prior for RBSets using a sampleof 225,842 pairs labeled as having no interaction, as selected by Qi et al.(2006).

For each tentative query set S of categories {M1,M2}, we scored andranked pairs P "

i :P"j such that both P "

i and P "j were connected to some pro-

tein appearing in S by a path of no more than two steps, according to theMIPS network. The reasons for the filtering are two-fold: to increase the

6As a consequence, none uses negative data. Another consequence is the necessity ofmodeling the input space that generates X, a di"cult task given the dimensionality andthe continuous nature of the features.

Page 21: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 21

Method #AUC #TOP10 #AUC.S #TOP10.S

COS 240 294 219 277NNS 42 122 28 75MLS 105 270 52 198

RBSets 542 556 578 587

(a)

Method #AUC #TOP10 #AUC.S #TOP10.S

COS 314 356 306 340NNS 75 146 62 111MLS 273 329 246 272

RBSets 267 402 245 387

(b)Table 3

Number of times each method wins when querying pairs of MIPS classes using the MIPSprotein-protein interaction network. The first two columns, #AUC and #TOP10, countthe number of times the respective method obtains the best score according to the AUC

and TOP10 measures, respectively, among the 4 approaches. This is divided by thenumber of replications of each query type (5). The last two columns, #AUC.S and

#TOP10.S are “smoothed” versions of this statistic: a method is declared the winner ofa round of 5 replications if it obtains the best score in at least 3 out of the 5 replications.The top table shows the results when only the continuous variables are used by RBSets,

and in the bottom table when the discrete variables are also given to RBSets.

computational performance of the ranking since fewer pairs are scored; andto minimize the chance that undesirable pairs would appear in the top 10ranked pairs. Tentative queries would not be performed if after filtering weobtained fewer than 50 possible correct matches. Trivial queries, where filter-ing resulted only in pairs in the same class as the query, were also discarded.The resulting number of unique pairs of categories {M1,M2} was 931 classesof interactions. For each pair of categories, we sampled our query set S 5times, generating a total of 4,655 rankings per algorithm.

We run two types of experiments. In one version, we give to RBSets thedata containing only the 45 (continuous) microarray measurements. In thesecond variation, we provide to RBSets all 61 variables, including the 16sparse binary indicators. However, we noticed that the addition of the 16binary variables hurts RBSets considerably. We conjecture that one reasonmight be the degradation of the variational approximation. Including thebinary variables hardly changed the other three methods, so we choose touse the 61 variable dataset for the other methods7.

7We also performed an experiment (not included) where only the continuous attributeswere used by the other methods. The advantage of RBSets still increased, slightly (bya 2% margin against the cosine distance method). For this reason, we analyze the most

Page 22: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

22 SILVA ET AL.

Table 3 summarizes the results of this experiment. We show the numberof times each method wins according to both the AUC and TOP10 criteria.The number of wins is presented as divided by 5, the number of randomsets generated for each query type {M1,M2} (notice these numbers do notneed to add up to 931, since ties are possible). Moreover, we also presented“smoothed” versions of this statistic, where we count a method as the winnerfor any given {M1,M2} category if, for the group of 5 queries, the methodobtains the best result in at least 3 of the sets. The motivation is to smoothout the extra variability added by the particular set of 15 protein pairs for afixed {M1,M2}. The proposed relational Bayesian sets method is the clearwinner according to all measures when we select only the continuous vari-ables. For this reason, for the rest of this section all analysis and experimentswill consider only this case.

Table 4 displays a pairwise comparison of the methods. In this Table, weshow how often the row method performs better than the column method,among those trials where there was no tie. Again, RBSets dominates.

Another useful summary is the distribution of correct hits in the top 10ranked elements across queries. This provides a measure of the di"culty ofthe problem, besides the relative performance of each algorithm. In Table 5,we show the proportion of correct hits among the top 10 for each algorithmfor our queries using MIPS categorization and also GO categorization, asexplained in the next section. About 14% of the time, all pairs in the top 10pairs ranked by RBSets were of the intended type, compared to 8% of thesecond best approach.

5.2.1. Changing the categorization system. A variation of this experi-ment was performed where the protein categorizations do not come fromthe same family as the link network, i.e., where we used the MIPS net-

pessimistic case.

AUC TOP10COS NNS MLS RBSets COS NNS MLS RBSets

COS - 0.67 0.43 0.30 - 0.70 0.46 0.30NNS 0.32 - 0.18 0.06 0.29 - 0.25 0.11MLS 0.56 0.81 - 0.25 0.53 0.74 - 0.28

RBSets 0.69 0.93 0.74 - 0.69 0.88 0.71 -Table 4

Pairwise comparison of methods according to the AUC and TOP10 criterion. Each cellshows the proportion of the trials where the method in the respective row wins over the

method in the column, according to both criteria. In each cell, the proportion is calculatedwith respect to the 4,655 rankings where no tie happened.

Page 23: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 23

Proportion of top hits using MIPS categories and links specified by the MIPS database0 1 2 3 4 5 6 7 8 9 10

COS 0.12 0.15 0.12 0.10 0.08 0.07 0.06 0.05 0.04 0.07 0.08NNS 0.29 0.16 0.14 0.10 0.06 0.05 0.03 0.03 0.03 0.03 0.02MLS 0.12 0.12 0.12 0.10 0.09 0.08 0.07 0.06 0.07 0.06 0.07

RBSets 0.04 0.08 0.09 0.09 0.09 0.08 0.09 0.07 0.09 0.08 0.14

Proportion of top hits using GO categories and links specified by the MIPS database0 1 2 3 4 5 6 7 8 9 10

COS 0.12 0.13 0.11 0.10 0.11 0.09 0.06 0.06 0.04 0.06 0.06NNS 0.53 0.23 0.07 0.02 0.02 0.02 0.04 0.01 0.00 0.00 0.01MLS 0.16 0.11 0.12 0.10 0.08 0.08 0.08 0.06 0.05 0.06 0.05

RBSets 0.09 0.09 0.10 0.10 0.08 0.08 0.06 0.08 0.08 0.07 0.12Table 5

Distribution across all queries of the number hits in the top 10 pairs, as ranked by eachalgorithm. The more skewed to the right, the better. Notice that using GO categories

doubles the number of zero hits for RBSets.

Method #AUC #TOP10 #AUC.S #TOP10.S

COS 58 73 58 72NNS 1 10 0 4MLS 26 55 13 38

RBSets 93 105 101 110Table 6

Number of times each method wins when querying pairs of GO classes using the MIPSprotein-protein interaction network. Columns #AUC, #TOP10, #AUC.S and

#TOP10.S are defined as in Table 3.

work but not the MIPS categorization. Instead we performed queries ac-cording to Gene Ontology categories. Starting from 150 pre-selected GOcategories (Myers et al., 2006), we once again generated unordered categorypairs {M1,M2}. A total of 179 queries, with 5 replications each (a total of895 rankings), were generated and the results summarized in Table 6.

This is a more challenging scenario for our approach, which is optimizedwith respect to MIPS. Still, we are able to outperform other approaches.Di!erences are less dramatic, but consistent. In the pairwise comparison ofRBSets against the second best method, cos, our method wins 62% of thetime by the TOP10 criterion.

5.2.2. The role of filtering. In both experiments with the MIPS network,we filtered candidates by examining only a subset of the proteins linked tothe elements in the query set by a path of no more than two proteins. It isrelevant to evaluate how much coverage of each category pair {M1,M2} weobtain by this neighborhood selection.

For each query S, we calculate the proportion of pairs Pi:Pj of the same

Page 24: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

24 SILVA ET AL.

0

0.2

0.4

0.6

0.8

1

[0,60] (60,70] (70,80] (80,90] (90,100] 5Recall

Distribution of Average Proportion of Reachable Links(Links Specified by the MIPS Database)

MIPSGO

Fig 6. Distribution of the coverage of valid pairs in the MIPS network, according to ourgenerated query sets. Results are broken into the two categorization systems (MIPS andGO) used in this experiment.

categorization {M1,M2} such that both Pi and Pj are included in the neigh-borhood. Figure 6 shows the resulting distributions of such proportions(from 0 to 100%): a histogram for the MIPS search and a histogram forthe GO search. Despite the small neighborhood, coverage is large. For theMIPS categorization, 93% of the queries resulted in a coverage of at least75% (with 24% of the queries resulting in perfect coverage). Although fil-tering implies that some valid pairs will never be ranked, the gain obtainedby reducing false positives in the top 10 ranked pairs is considerable (resultsnot shown) across all methods, and the computational gain of reducing thesearch space is particularly relevant in exploratory data analysis.

5.3. Results on the KEGG collection of signaling pathways. We repeatthe same experimental setup, now using the KEGG network to define theprotein-protein interactions. We selected proteins from the KEGG catego-rization system for which we had data available. A total of 6,125 proteinswas selected. The KEGG network is much more dense than MIPS. A totalof 38,961 positive pairs and 226,188 negative links were used to generate ourempirical prior.

However, since the KEGG network is much more dense than MIPS, wefiltered our candidate pairs by allowing only proteins that are directly linkedto the proteins in the query set S. Even under this restriction, we are ableto obtain high coverage: the neighborhood of 90% of the queries includedall valid pairs of the same category, and essentially all queries included atleast 75% of the pairs falling in the same category as the query set. A

Page 25: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 25

total of 1,523 possible pairs of categories (7,615 queries, considering the 5replications) were generated.

Results are summarized in Table 5.3. Again, it is evident that RBSets

dominates other methods. In the pairwise comparison against cos, RB-

Sets wins 76% of the times according to the TOP10 criterion. However, theranking problem in the KEGG network was much harder than in the MIPSnetwork (according to our automated non-analogical criterion). We believethat the reason is that, in KEGG, the simple filtering scheme has much lessinfluence as reflected by the high coverage. The distribution of the numberof hits in the top 10 ranked items is show in Table 8. Despite the success ofRBSets relative to the other algorithms, there is room for improvement.

6. More related work. There is a large literature on analogical rea-soning in artificial intelligence and psychology. We refer to French (2002)for a survey, and to more recent papers on clustering (Marx et al., 2002),prediction (Turney and Littman, 2005; Turney, 2008a) and dimensionalityreduction (Memisevic and Hinton, 2005) as examples of other applications.Classical approaches for planning have also exploited analogical similarities(Veloso and Carbonell, 1993).

Non-probabilistic similarity functions between relational structures havealso been developed for the purpose of deriving kernel matrices, such asthose required by support vector machines. Borgwardt (2007) provides acomprehensive survey and state-of-the-art methods. It would be interestingto adapt such methods to problems of of analogical reasoning.

The graphical model formulation of Getoor et al. (2002) incorporates mod-els of link existence in relational databases, an idea used explicitly in Section3 as the first step of our problem formulation. In the clustering literature, theprobabilistic approach of Kemp et al. (2006) is motivated by principles sim-ilar to those in our formulation: the idea is that there is an infinite mixtureof subpopulations that generates the observed relations. Our problem, how-ever, is to retrieve other elements of a subpopulation described by elements

Method #AUC #TOP10 #AUC.S #TOP10.S

COS 159 575 134 507NNS 30 305 17 227MLS 290 506 199 431

RBSets 1042 1091 1107 1212Table 7

Number of times each method wins when querying pairs of KEGG classes using theKEGG protein-protein interaction network. Columns #AUC, #TOP10, #AUC.S and

#TOP10.S are defined as in Table 3.

Page 26: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

26 SILVA ET AL.

Proportion of top hits using KEGG categories and links specified by the KEGG database0 1 2 3 4 5 6 7 8 9 10

COS 0.56 0.21 0.08 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01NNS 0.89 0.03 0.04 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00MLS 0.57 0.21 0.08 0.04 0.02 0.01 0.01 0.00 0.00 0.00 0.00

RBSets 0.29 0.24 0.16 0.09 0.06 0.03 0.02 0.01 0.03 0.02 0.01Table 8

Distribution across all queries of the number hits in the top 10 pairs, as ranked by eachalgorithm. The more skewed to the right, the better.

of a query set, a goal that is closer to the classical paradigm of analogicalreasoning.

As discussed in Section 3.2, our model can be interpreted as a type ofblock-model (Kemp et al., 2006; Xu et al., 2006; Airoldi et al., 2008) withobservable features. Link indicators are independent given the object fea-tures, which might not actually be the case for particular choices of featurespace. In theory, block-models sidestep this issue by learning all the neces-sary latent features that account for link dependence. An important futureextension of our work would consist of tractably modeling the residual linkassociation that is not accounted for by our observed features.

Discovering analogies is a specific task within the general problem of gen-erating latent relationships from relational data. Some of the first formalmethods for discovering latent relationships from multiple datasets were in-troduced in the literature of inductive logic programming, such as the inverseresolution method (Muggleton, 1981). A more recent probabilistic methodis discussed by Kok and Domingos (2007). Dzeroski and Lavrac (2001) andGetoor and Taskar (2007) provide an overview of relational learning meth-ods from a data mining and machine learning perspective.

A particularly active subfield on latent relationship generation lies withintext analysis research. For instance, Stephens et al. (2001) describe an ap-proach for discovering relations between genes given MEDLINE abstracts. Inthe context of information retrieval, Cafarella et al. (2006) describe an appli-cation of recent unsupervised information extraction methods: relations gen-erated from unstructured text documents are used as a pre-processing step tobuild an index of webpages. In analogical reasoning applications, our methodhas been used by others for question-answering analysis (Wang et al., 2009).

The idea of measuring the similarity of two data points based on a predic-tive function has appeared in the literature on matching for causal inference.Suppose we are given a model for predicting an outcome Y given a treatmentZ and a set of potential confounders X. For simplicity, assume Z % {0, 1}.The goal of matching is to find, for each data point (Xi, Yi, Zi), the closest

Page 27: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 27

match (Xj, Yj , Zj) according to the confounding variables X. In principle,any clustering criterion could be used in this task (Gelman and Hill, 2007).The propensity score criterion (Rosenbaum, 2002) measures the similarity oftwo feature vectors Xi and Xj by comparing the predictions P (Zi = 1 | Xi)and P (Zj = 1 | Xj). If the conditional P (Z = 1 | X) is given by a logisticregression model with parameter vector $, Gelman and Hill (2007) suggestmeasuring the di!erence between XT

i $ and XTj $. While this is not the same

as comparing two predictive functions as in our framework, the core idea ofusing predictive functions to define similarity remains.

A preliminary version of this paper appeared in the proceedings of the11th International Conference on Artificial Intelligence and Statistics (Silva et al.,2007).

7. Conclusion. We have presented a framework for performing ana-logical reasoning within a Bayesian data analysis formulation. There is ofcourse much more to analogical reasoning than calculating the similarityof related pairs. As future work, we will consider hierarchical models thatcould in principle compare relational structures (such as protein complexes)of di!erent sizes. In particular, the literature on graph kernels (Borgwardt,2007) could provide insights on developing e"cient similarity metrics withinour probabilitistic framework.

Also, we would like to combine the properties of the mixed-membershipstochastic block model of Airoldi et al. (2008), where objects are clusteredinto multiple roles according to the relationship matrix LAB , with our frame-work where relationship indicators are conditionally independent given ob-served features.

Finally, we would like to consider the case where multiple relationshipmatrices are available, allowing for the comparison of relational structureswith multiple types of objects.

Much remains to be done to create a complete analogical reasoning sys-tem, but the described approach has immediate applications to informationretrieval and exploratory data analysis.

Acknowledgements. We would like to thank the anonymous reviewersand the editor for several suggestions that improved the presentation of thispaper, and for additional relevant references.

References.

E. M. Airoldi. Getting started in probabilistic graphical models. PLoS ComputationalBiology, 3(12):e252, 2007.

E. M. Airoldi, D. M. Blei, E. P. Xing, and S. E. Fienberg. A latent mixed-membershipmodel for relational data. In Workshop on Link Discovery: Issues, Approaches and

Page 28: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

28 SILVA ET AL.

Applications, in conjunction with the 11th International ACM SIGKDD Conference,2005.

E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochasticblockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008.

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignmentsearch tool. Journal of Molecular Biology, 215(3):403–410, 1990.

M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver,A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubinand,and G. Sherlock. Gene ontology: Tool for the unification of biology. The gene ontologyconsortium. Nature Genetics, 25(1):25–29, 2000.

E. Banks, E. Nabieva, R. Peterson, and M. Singh. NetGrep: fast network schema searchesin interactomes. Genome Biology, 24:1473–1480, 2008.

A. Bernard, D. S. Vaughn, and A. J. Hartemink. Reconstructing the topology of proteincomplexes. In T. Speed and H. Huang, editors, Research in Computational MolecularBiology 2007 (RECOMB07), volume 4453 of Lecture Notes in Bioinformatics, pages32—46. Springer, 2007.

K. Borgwardt. Graph Kernels. PhD Thesis, Ludwig-Maximilians-University Munich, 2007.D. Botstein, S. A. Chervitz, and J. M. Cherry. Yeast as a model organism. Science, 277

(5330):1259–1260, 1997.B. J. Breitkreutz, C. Stark, and M. Tyers. The GRID: the General Repository for Inter-

action Datasets. Genome Biology, 4(R23), 2003.R. B. Brem, J. D. Storey, J. Whittle, and L. Kruglyak. Genetic interactions between

polymorphisms that a#ect gene expression in yeast. Nature, 436(7051):701–703, 2005.M. Cafarella, M. Banko, and O. Etzioni. Relational web search. University of Washington,

Department of Computer Science and Engineering, Technical Report 2006-04-02, 2006.J. M. Cherry, C. Ball, S. Weng, G. Juvik, R. Schmidt, C. Adler, B. Dunn, S. Dwight,

L. Riles, R. K. Mortimer, and D. Botstein. Genetic and physical maps of saccharomycescerevisiae. Nature, 387(6632 Suppl.):67–73, 1997.

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.Learning to extract symbolic knowledge from the World Wide Web. Proceedings ofAAAI’98, pages 509–516, 1998.

S. Dzeroski and N. Lavrac. Relational Data Mining. Springer, 2001.S. Fields and O. Song. A novel genetic system to detect protein-protein interactions.

Nature, 340(6230):245–246, 1989.S. E. Fienberg, M. M. Meyer, and S. Wasserman. Statistical analysis of multiple socio-

metric relations. Journal of the American Statistical Association, 80:51–67, 1985.R. French. The computational modeling of analogy-making. Trends in Cognitive Sciences,

6:200–205, 2002.A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Bot-

stein, and P. O. Brown. Genomic expression programs in the response of yeast cells toenvironmental changes. Molecular Biology of the Cell, 11:4241–4257, 2000.

A.-C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M.Rick, A.-M. Michon, C.-M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajen-ovic, H. Ru#ner, A. Merino, K. Klein, D. Dickson M. Hudak and, T. Rudi, V. Gnau,A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.-A. Heurtier, R. R. Copley, A. Edel-mann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork,B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organiza-tion of the yeast proteome by systematic analysis of protein complexes. Nature, 415:141–147, 2002.

Page 29: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 29

A.-C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J.Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M. Heurtier, V. Ho#man, C. Hoefert,K. Klein, M. Hudak, A. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper,A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster,P. Bork, R. B. Russell, and G. Superti-Furga. Proteome survey reveals modularity ofthe yeast cell machinery. Nature, 440:631–636, 2006.

A. Gelman and J. Hill. Data Analysis Using Multilevel/Hierarchical Models. CambridgeUniversity Press, 2007.

D. Gentner. Structure-mapping: a theoretical framework for analogy. Cognitive Science,7:155–170, 1983.

D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65(2-3):263–297, 1998.

L. Getoor and B. Taskar. Introduction to Statistical Relational Learning. MIT Press, 2007.L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link

structure. Journal of Machine Learning Research, 3:679–707, 2002.Z. Ghahramani and K. A. Heller. Bayesian sets. Advances in Neural Information Process-

ing Systems, 18:435–442, 2005.C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford,

N. M. Hannett, J. B. Tagne, D. B. Reynolds, J. Yoo, E. G. Jennings, J. Zeitlinger,D. K. Pokholok, M. Kellis, P. A. Rolfe, K. T. Takusagawa, E. S. Lander, D. K. Gi#ord,E. Fraenkel, and R. A. Young. Transcriptional regulatory code of a eukaryotic genome.Nature, 431:99–104, 2004.

Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.-L. Adams, A. Millar, P. Taylor,K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandor#, J. Shew-narane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin,K. Michalickova, A.R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. Ander-sen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Craw-ford, V. Poulsen, B. D. Sørensen, R. C. Hendrickson J. Matthiesen and, F. Gleeson,T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys, andM. Tyers. Systematic identification of protein complexes in saccharomyces cerevisiaeby mass spectrometry. Nature, 415:180–183, 2002.

P. D. Ho#. Modeling homophily and stochastic equivalence in symmetric relational data.In Advances in Neural Information Procesing Systems, volume 20, pages 657–664, 2008.

P. W. Holland and S. Leinhardt. Local structure in social networks. In D. Heise, editor,Sociological Methodology, pages 1–45. Jossey-Bass, 1975.

W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, andE. K. O’Shea. Global analysis of protein localization in budding yeast. Nature, 425:686–691, 2003.

T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara,and Y. Sakaki. Toward a protein-protein interaction map of the budding yeast: Acomprehensive system to examine two-hybrid interactions in all possible combinationsbetween the yeast proteins. Proceedings of the National Academy of Sciences, 97(3):1143–1147, 2000.

T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods.Statistics and Computing, 10:25–37, 2000.

D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias inrelational learning. 19th International Conference on Machine Learning, 2002.

Lars Juhl Jensen and Peer Bork. Biochemistry: Not comparable, but complementary.Science, 322(5898):56–57, 2008.

M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods

Page 30: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

30 SILVA ET AL.

for graphical models. Machine Learning, 37:183–233, 1999.M. Kanehisa and S. Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic

Acids Research, 28:27–30, 2000.R. Kass and A. Raftery. Bayes factors. Journal of the American Statistical Association,

90(430):773–795, 1995.C. Kemp, J. Tenenbaum, T. Gri#ths, T. Yamada, and N. Ueda. Learning systems of

concepts with an infinite relational model. Proceedings of AAAI’06, 2006.S. Kok and P. Domingos. Statistical predicate invention. 24th International Conference

on Machine Learning, 12:93–104, 2007.N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta,

A. P. Tikuisis, T. Punna, J. M. Peregrin-Alvarez, M. Shales, X. Zhang, M. Davey,M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P. Richards,V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete, J. Vlasblom,S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone, K. Gandi, N. J.Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Y. Lam, G. Butland, A. M.Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles, T. R.Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili, and J. F. Greenblatt. Globallandscape of protein complexes in the yeast Saccharomyces Cerevisiae. Nature, 440(7084):637–643, 2006.

F. Lorrain and H. C. White. Structural equivalence of individuals in social networks.Journal of Mathematical Sociology, 1:49–80, 1971.

C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cam-bridge University Press, 2008.

Z. Marx, I. Dagan, J. Buhmann, and E. Shamir. Coupled clustering: a method for detectingstructural correspondence. Journal of Machine Learning Research, 3:747–780, 2002.

R. Memisevic and G. Hinton. Multiple relational embedding. 18th NIPS, 2005.H. Mewes and et al. MIPS: analysis and annotation of proteins from whole genome.

Nucleic Acids Research, 32, 2004.S. Muggleton. Inverting the resolution principle. Machine Intelligence, 12:93–104, 1981.C. Myers, D. Robson, A. Wible, M. Hibbs, C. Chiriac, C. Theesfeld, K. Dolinski, and

O. Troyanskaya. Discovery of biological networks from diverse functional genomic data.Genome Biology, 6, 2005.

C. L. Myers, D. A. Barret, M. A. Hibbs, C. Huttenhower, and O. G. Troyanskaya. Findingfunction: An evaluation framework for functional genomics. BMC Genomics, 7(187),2006.

E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and Mona Singh. Whole-proteome predictionof protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21(Suppl. 1):i302–i310, 2005.

K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures.Journal of the American Statistical Association, 96:1077–1087, 2001.

A. Popescul and L. H. Ungar. Structural logistic regression for link analysis. Multi-Relational Data Mining Workshop at KDD-2003, 2003.

M. Primig, R. M. Williams, E. A. Winzeler, G. G. Tevzadze, A. R. Conway, S. Y. Hwang,R. W. Davis, and R. E. Esposito. The core meiotic transcriptome in budding yeasts.Nature Genetics, 26(4):415–423, 2000.

Y. Qi, Z. Bar-Joseph, and J. Klein-Seetharaman. Evaluation of di#erent biological dataand computational classification methods for use in protein interaction prediction. Pro-teins: Structure, Function, and Bioinformatics, 63:490–500, 2006.

Teresa Reguly, Ashton Breitkreutz, Lorrie Boucher, Bobby-Joe Breitkreutz, Gary Hon,Chad Myers, Ainslie Parsons, Helena Friesen, Rose Oughtred, Amy Tong, Chris Stark,

Page 31: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

RANKING RELATIONS USING ANALOGIES 31

Yuen Ho, David Botstein, Brenda Andrews, Charles Boone, Olga Troyanskya, TreyIdeker, Kara Dolinski, Nizar Batada, and Mike Tyers. Comprehensive curation andanalysis of global interaction networks in saccharomyces cerevisiae. Journal of Biology,5(4):11, 2006. ISSN 1475–4924.

P. Rosenbaum. Observational Studies. Springer-Verlag, 2002.D. Rumelhart and A. Abrahamson. A model for analogical reasoning. Cognitive Psychol-

ogy, 5:1–28, 1973.SGD. Saccharomyces Genome Database. URL ftp://ftp.yeastgenome.org/yeast/.R. Silva, K. A. Heller, and Z. Ghahramani. Analogical reasoning with relational Bayesian

sets. 11th International Conference on Artificial Intelligence and Statistics, AISTATS,2007.

M. Stephens, M. Palakal, S. Mukhopadhyay, R. Raje, and J. Mostafa. Detecting generelations from MEDLINE abstracts. Proceedings of the Sixth Annual Pacific Symposiumon Biocomputing, pages 483–496, 2001.

Kirill Tarassov, Vincent Messier, Christian R. Landry, Stevo Radinovic, Mercedes M. SernaMolina, Igor Shames, Yelena Malitskaya, Jackie Vogel, Howard Bussey, and Stephen W.Michnick. An in vivo map of the yeast protein interactome. Science, 320(5882):1465–1470, 2008.

J. Tenenbaum and T. Gri"ths. Generalization, similarity, and Bayesian inference. Be-havioral and Brain Sciences, 24:629–641, 2001.

TRANSFAC. Transcription factor database. URL http://www.gene-regulation.com/.P. Turney. The latent relation mapping engine: Algorithm and experiments. Journal of

Artificial Intelligence Research, 33:615–655, 2008a.P. Turney. A uniform approach to analogies, synonyms, antonyms, and associations. Pro-

ceedings of the 22nd International Conference on Computational Linguistics (COLING-08), pages 905–912, 2008b.

P. Turney and M. Littman. Corpus-based learning of analogies and semantic relations.Machine Learning, 60:251–278, 2005.

P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lockshon,V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover,T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J. M. Rothberg.A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae.Nature, 403(6770):623–627, 2000.

M. Veloso and J. Carbonell. Derivational analogy in PRODIGY: Automating case acqui-sition, storage and utilization. Machine Learning, 10, 1993.

C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork.Comparative assessment of large-scale data sets of protein-protein interactions. Nature,417(6887):399–403, 2002.

X-J. Wang, X. Tu, D. Feng, and L. Zhang. Ranking community answers by modelingquestion-answer relationships via analogical reasoning. Proceedings of the 32nd AnnualACM SIGIR Conference on Research & Development on Information Retrieval, 2009.

Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Uncertaintyin Artificial Intelligence, 2006.

Haiyuan Yu, Pascal Braun, Muhammed A. Yildirim, Irma Lemmens, Kavitha Venkatesan,Julie Sahalie, Tomoko Hirozane-Kishikawa, Fana Gebreab, Na Li, Nicolas Simonis, TongHao, Jean-Francois Rual, Amelie Dricot, Alexei Vazquez, Ryan R. Murray, ChristopheSimon, Leah Tardivo, Stanley Tam, Nenad Svrzikapa, Changyu Fan, Anne-Sophiede Smet, Adriana Motyl, Michael E. Hudson, Juyong Park, Xiaofeng Xin, Michael E.Cusick, Troy Moore, Charlie Boone, Michael Snyder, Frederick P. Roth, Albert-LaszloBarabasi, Jan Tavernier, David E. Hill, and Marc Vidal. High-quality binary protein

Page 32: Cambridge Machine Learning Group | Homepage - RANKING …mlg.eng.cam.ac.uk/pub/pdf/SilHelGhaetal10.pdf · comes from the bioPIXIE1 project (Myers et al., 2005). bioPIXIE is a tool

32 SILVA ET AL.

interaction map of the yeast interactome network. Science, 322(5898):104–110, 2008.G. Yvert, R. B. Brem, J. Whittle, J. M. Akey, E. Foss, E. N. Smith, R. Mackelprang, and

L. Kruglyak. Trans-acting regulatory variation in saccharomyces cerevisiae and the roleof transcription factors. Nature Genetics, 35(1):57–64, 2003.

J. Zhu and M. Q. Zhang. SCPD: a promoter database of the yeast Saccharomyces cere-visiae. Bioinformatics, 15:607–611, 1999.

Gower StreetUniversity College LondonLondon, WC1E 6BT, UKE-mail: [email protected]

Trumpington StreetUniversity of CambridgeCambridge, CB2 1PZ, UKE-mail: [email protected]

[email protected]

1 Oxford streetHarvard UniversityCambridge, MA 02138, USAE-mail: [email protected]