latent factor models with additive and hierarchically ...bhargav kanagal* google inc....

10
Latent Factor Models with Additive and Hierarchically-smoothed User Preferences Amr Ahmed * Google Inc. [email protected] Bhargav Kanagal* Google Inc. [email protected] Sandeep Pandey* Twitter [email protected] Vanja Josifovski* Google Inc. [email protected] Lluis Garcia Pueyo* Google Inc. [email protected] Jeff Yuan Yahoo! Research, [email protected] ABSTRACT Items in recommender systems are usually associated with anno- tated attributes such as brand and price for products; agency for news articles, etc. These attributes are highly informative and must be exploited for accurate recommendation. While learning a user preference model over these attributes can result in an interpretable recommender system and can hands the cold start problem, it suf- fers from two major drawbacks: data sparsity and the inability to model random effects. On the other hand, latent-factor collabo- rative filtering models have shown great promise in recommender systems; however, its performance on rare items is poor. In this pa- per we propose a novel model LFUM, which provides the advan- tages of both of the above models. We learn user preferences (over the attributes) using a personalized Bayesian hierarchical model that uses a combination (additive model) of a globally learned pref- erence model along with user-specific preferences. To combat data- sparsity, we smooth these preferences over the item-taxonomy us- ing an efficient forward-filtering and backward-smoothing infer- ence algorithm. Our inference algorithms can handle both discrete attributes (e.g., item brands) and continuous attributes (e.g., item prices). We combine the user preferences with the latent-factor models and train the resulting collaborative filtering system end-to- end using the successful BPR ranking algorithm. In our extensive experimental analysis, we show that our proposed model outper- forms several commonly used baselines and we carry out an abla- tion study showing the benefits of each component of our model. Categories and Subject Descriptors G.3 [Probability And Statistics]: Statistical Computing; I.2.6 [Computing Methodologies]: Artificial Intelligence-Learning General Terms Algorithms, Experimentation, Performance * The work was performed at Yahoo! Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSDM’13, February 4–8, 2013, Rome, Italy. Copyright 2013 ACM 978-1-4503-1869-3/13/02 ...$15.00. Keywords Recommendation, Latent Variable Models, Collaborative Filtering, Inference, Factor Models 1. INTRODUCTION Personalized recommender systems are ubiquitous with applica- tions to computational advertising, content suggestions, e-commerce and search engines. These systems use the past behavior of users to recommend items to them that are likely to be of interest. Sig- nificant advancements have been made in recent years to improve the accuracy of such personalized recommendation systems and to scale the algorithms to large amount of data. One of the most extensively used techniques in these recommen- dation systems is the latent factor model and its variants [2, 13, 14, 16] (see Koren et al. [15] for an excellent survey). The key idea behind latent factor models is to project the users and items into a smaller dimensional space (such lower dimensional projec- tions are called factors), thereby clustering similar users and items. Subsequently, the interest (similarity) of a user to an unrated item is measured and the most similar item(s) is recommended to the user. While factor models have been very successful with the Net- flix contest top performers [5, 13] and tag recommendation [19] being some of the success stories, these models have certain short- comings. Much of these techniques do not take into account, the auxiliary information that is typically associated with items. In most domains, the items being recommended have a rich set of at- tributes which are highly indicative of users’ preferences, i.e., they determine the users’ intent to purchase the item. For instance, in the movie recommendation application, there is information about the director, lead actor, duration of the movie, time of release, all of which play a significant role in the user selecting the movie. Movies that have reputed directors are usually chosen despite lack of user interest as modeled by a latent factor model. In the retail product recommendation, we have attributes such as the brand and price that determines user’s interest in the product. The other dis- advantage with this approach is that of sparsity, commonly referred to as the cold-start problem. In order to leverage such auxiliary information, there has been a number of efforts to build content-based recommender systems [6]. While content-based approaches work well in the presence of cold- start, they are outperformed by the latent factor models in warm- start scenarios, i.e., when we have enough training data. Hybrid recommendation models [9, 8, 6] that combine collaborative and content information have been proposed to overcome this issue. However, as we will show in our experiments, these hybrid meth- ods are expensive to to train and does not perform well when the

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

Latent Factor Models with Additive andHierarchically-smoothed User Preferences

Amr Ahmed∗

Google [email protected]

Bhargav Kanagal*Google Inc.

[email protected]

Sandeep Pandey*Twitter

[email protected]

Vanja Josifovski*Google Inc.

[email protected]

Lluis Garcia Pueyo*Google Inc.

[email protected]

Jeff YuanYahoo! Research,

[email protected]

ABSTRACTItems in recommender systems are usually associated with anno-tated attributes such as brand and price for products; agency fornews articles, etc. These attributes are highly informative and mustbe exploited for accurate recommendation. While learning a userpreference model over these attributes can result in an interpretablerecommender system and can hands the cold start problem, it suf-fers from two major drawbacks: data sparsity and the inability tomodel random effects. On the other hand, latent-factor collabo-rative filtering models have shown great promise in recommendersystems; however, its performance on rare items is poor. In this pa-per we propose a novel model LFUM, which provides the advan-tages of both of the above models. We learn user preferences (overthe attributes) using a personalized Bayesian hierarchical modelthat uses a combination (additive model) of a globally learned pref-erence model along with user-specific preferences. To combat data-sparsity, we smooth these preferences over the item-taxonomy us-ing an efficient forward-filtering and backward-smoothing infer-ence algorithm. Our inference algorithms can handle both discreteattributes (e.g., item brands) and continuous attributes (e.g., itemprices). We combine the user preferences with the latent-factormodels and train the resulting collaborative filtering system end-to-end using the successful BPR ranking algorithm. In our extensiveexperimental analysis, we show that our proposed model outper-forms several commonly used baselines and we carry out an abla-tion study showing the benefits of each component of our model.

Categories and Subject DescriptorsG.3 [Probability And Statistics]: Statistical Computing; I.2.6[Computing Methodologies]: Artificial Intelligence-Learning

General TermsAlgorithms, Experimentation, Performance

∗The work was performed at Yahoo! Research.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WSDM’13, February 4–8, 2013, Rome, Italy.Copyright 2013 ACM 978-1-4503-1869-3/13/02 ...$15.00.

KeywordsRecommendation, Latent Variable Models, Collaborative Filtering,Inference, Factor Models

1. INTRODUCTIONPersonalized recommender systems are ubiquitous with applica-

tions to computational advertising, content suggestions, e-commerceand search engines. These systems use the past behavior of usersto recommend items to them that are likely to be of interest. Sig-nificant advancements have been made in recent years to improvethe accuracy of such personalized recommendation systems and toscale the algorithms to large amount of data.

One of the most extensively used techniques in these recommen-dation systems is the latent factor model and its variants [2, 13,14, 16] (see Koren et al. [15] for an excellent survey). The keyidea behind latent factor models is to project the users and itemsinto a smaller dimensional space (such lower dimensional projec-tions are called factors), thereby clustering similar users and items.Subsequently, the interest (similarity) of a user to an unrated itemis measured and the most similar item(s) is recommended to theuser. While factor models have been very successful with the Net-flix contest top performers [5, 13] and tag recommendation [19]being some of the success stories, these models have certain short-comings. Much of these techniques do not take into account, theauxiliary information that is typically associated with items. Inmost domains, the items being recommended have a rich set of at-tributes which are highly indicative of users’ preferences, i.e., theydetermine the users’ intent to purchase the item. For instance, inthe movie recommendation application, there is information aboutthe director, lead actor, duration of the movie, time of release, allof which play a significant role in the user selecting the movie.Movies that have reputed directors are usually chosen despite lackof user interest as modeled by a latent factor model. In the retailproduct recommendation, we have attributes such as the brand andprice that determines user’s interest in the product. The other dis-advantage with this approach is that of sparsity, commonly referredto as the cold-start problem.

In order to leverage such auxiliary information, there has been anumber of efforts to build content-based recommender systems [6].While content-based approaches work well in the presence of cold-start, they are outperformed by the latent factor models in warm-start scenarios, i.e., when we have enough training data. Hybridrecommendation models [9, 8, 6] that combine collaborative andcontent information have been proposed to overcome this issue.However, as we will show in our experiments, these hybrid meth-ods are expensive to to train and does not perform well when the

Page 2: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

$400   $250   $40  

+20   -­‐10  

+10  

+5  

 -­‐  7  

+15   -­‐  6  c1  

c2   c3  c4  

Figure 1: Illustrating the role of hierarchy and additive model in learninguser preferences. Each internal node gives the user preferred mean price asan addition over the global mean price at the same node. The user boughtitems shaded at the leaves of the tree and our model inferred the numberinside each internal node of the user’s tree. See Section 3.1 for more details.

attributes of the items live in a very high-dimensional space. Fur-thermore, it is not easy to augment these hybrid models with addi-tional background information, such as a taxonomy over items.

In this paper we propose the LFUM (Latent Factor augmentedwith User preferences Model) which combines latent factor mod-els along with a user preference models for improving personaliza-tion in recommendation. A key to our model is that we capture theuser preferences over each of the item attributes, using an additionof a global preference model and a per-user personalized prefer-ence model. We smooth these user preferences over a taxonomy ofthe items, to alleviate the aforementioned sparsity problem. Tax-onomies are commonly available for many domains [1, 12]. Forexample, consider the price attribute: Suppose that Alice boughtan expensive laptop. In this case, our model can learn that Alicebuys electronics items at a price $200 more than the globalaverage. The preferences in our model are learned using a hier-archical additive model. We develop an efficient forward filter-ing, backward-smoothing algorithm over the taxonomy. Continu-ing with the above example, if Alice buys expensive electronics,then that increases the likelihood that she would purchase an ex-pensive phone (even though she has not purchased any phone). Wesupport both discrete-valued (e.g., brand) and continuous-valued(e.g., price) attributes. For continuous variables, we extend thewell known Kalman filtering/smoothing [11] algorithm to multipleinteracting hierarchies using an additive model. Furthermore, wecombine these user preferences with a latent factor model and opti-mize an efficient lower bound on the joint objective function usingthe Bayesian personalized ranking algorithm of Rendle et al. [18].We apply LFUM to the real world problem of smart ad selectionfor Display advertising and demonstrate significant improvementsin the recommendation accuracy using our approaches.

Outline: The rest of this paper is organized as follows. First in Sec-tion 2 we establish notations, formalize the problem and review rel-evant background. Then in Section 3 we describe detail our model.In Section 4 we give an efficient inference algorithm and in Sec-tion 5-6 we show a comprehensive evaluation of our model againstseveral baseline. Section 7 covers related work and we conclude inSection 8.

2. PRELIMINARIESIn this Section we provide background for the various concepts

in the paper. We begin with the notations used in the paper and ex-plain commonly used recommender systems in the literature, namelylatent factor models and hybrid recommender systems.

2.1 NotationsLet I be a set of items and U be a set of users. For each u ∈ U ,

we let Iu denote the set of items the user interacted with (ex. pur-chased before). Each item I ∈ I is described by a set of continuous(such as price) and discrete (such as brand) attributes. We let I.xdenote a continuous attribute of I and x ∈ X , where X is the setof continuous attributes. Similarly, we let I.y denote a discrete at-tribute of I for y ∈ Y , where Y is the set of discrete attributes. Inaddition, each item is associated with a category. We use π(I) todenote the category of item I . The categories themselves are ar-ranged in a tree-like taxonomy (or a hierarchy). For a given nodein the tree, say n, we abuse notation a a bit, and we let π(n) denotethe parent of node n in the tree, and In denotes the children of noden. If n is a leaf category, then In is a set of items, and if n is aninternal node, then In is a set of subcategories. We let L denote thedepth of the tree, where the root is at level 0 and the leaf categoriesare at level L − 1. We also use T (n) to denote the subtree rootedat node n. Our goal is to learn user preferences over attributes foreach category in the tree and subsequently, use these preferencesto augment a factor model, thereby improving recommendation ofitems to users.

2.2 Latent Factor ModelsThe input to a recommender system is a sparse (partially popu-

lated) user-item matrix where the entries correspond to an interac-tion between a user and an item – either in terms of a rating or apurchase. The goal of the recommender system is to predict, foreach user u, a ranked list of items. We assume that each user u anditem i can be represented by latent factors vu and vi respectively,which are vectors of size 1×K. User u’s affinity/interest in item i(denoted by zui) is assumed to follow this model:

zui = 〈vu,vi〉

The learning problem here is to determine the best values for vu

and vi (for all users u and all items i) based on the given ratingmatrix; we denote these parameters by Γ. While traditional ap-proaches to matrix factorization try to regress over the known en-tries of the matrix, a more successful approach is the recently pro-posed Bayesian personalized ranking (BPR) [18]. Here, the trick isto do regression directly over the ranks of the items, rather than theactual ratings since the goal is to construct ranked lists. Also, weonly have implicit feedback from the users (i.e., we will have a rat-ing between 1 to 5, but only know that the user made a purchase).In this case, regression over the actual numbers of purchases is notmeaningful. The main objective in BPR is to discriminate betweenitems bought by the user from those that were not bought. In otherwords, we need to learn a ranking function Ru for each user u thatranks u’s interesting items higher than the non-interesting items. Inother words, if item i appears in user u’s purchase list Bu and itemj does not appear in Bu, then we must have Ru(i) > Ru(j). Forthis, we need to have: zui > zuj . Based on the above arguments,our likelihood function p(Ru|Γ) is:

p(Ru|Γ) =∏u∈U

∏i∈Bu

∏j /∈Bu

σ(zui − zuj)

Following Rendle et al. [18], we approximate the non-smooth, non-differentiable expression zui > zuj using the logistic sigmoid func-tion σ(zui − zuj), where σ(z) = 1

1+e−z . We use a Gaussian priorN(0, σ)

Page 3: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

over all the factors in Γ and compute the MAP (maximum a pos-teriori) estimate of Γ. The posterior over Γ (which needs to bemaximized) is given by:

p(Γ|Ru) = p(Γ)p(Ru|Γ)

= p(Γ)∏u∈U

∏i∈Bu

∏j /∈Bu

σ(zui − zuj)

2.3 Hybrid Recommender SystemsAs described in Section 1, hybrid recommender systems [6] com-

bine latent factor models along with the content-based approachesbased on user and item features. Much of the approaches work byfirst fitting a regression function over the item features against thetarget feedback and subsequently modeling the residual with a la-tent factor model. We can represent hybrid recommender systemsin our context using the following affinity expression.

zij = 〈vu, vi〉+∑x∈X

Wxi.x+∑y∈Y

Wyi.y

The first term in the expression corresponds to the traditionallatent factor model. In the second term, Wx denotes the weightsfor each of the discrete attribute set X (Note that we abuse nota-tion slightly: i.x corresponds to the value of attribute x of item i).Similarly, the third term corresponds to the weights learned overthe continuous attributes. In practice, we can extend hybrid recom-mender systems to learn both the weights and the user/item factorssimultaneously. However, such an approach is problematic in twoaspects: First, the weights are learned over all the users, thus wecannot model user personalization. Further more, we cannot learnper-user weights owing to the sparsity of the data. In addition, it isunclear how to learn weights for the continuous attributes withoutdiscretizing them and as such introducing additional noise. In thenext section we will introduce our LFUM which addresses theseproblems in a principled way.

3. AUGMENTED LATENT FACTOR MOD-ELS: LFUM

This section describes our proposed approach, a Latent Factoraugmented with a User preference Model (LFUM). We will beginby first describing the intuition behind the model and then give aformal description. Our model is a hybrid model that combinesthe observed item attributes with a latent factor model. However,our model is novel since it doesn’t learn a regression function overitem attributes but rather learn a user-specific probability distribu-tion over item attributes. Let us define qx(i.x|u) to be the probabil-ity that user u likes item i based on item’s i value of attribute x. Weabuse notation slightly and write qux(i.x) to denote qx(i.x|u). Onecould imagine that we can combine these probabilities for each at-tributed and learn a linear function that predicts the user affinity toeach item. However, the observed attributes might not be enoughto explain why the user would buy item i instead of item j. Forexample, the user might prefer item i over item j based on an at-tribute that is not observed (for instance, the shape or the color ofthe item). Therefore, we define P (Ru|Γ, q) to denote the rankingprobabilities for user i. As we made explicit in this probability,the raking depends on q and is parameterized by a set of parame-ters Γ to be defined later. Our goal thus is to solve the followingoptimization problem:

maxΓ

∏u

P (Ru|Γ)

= maxΓ

∫q

P (q)∏u

P (Ru|Γ, q)dq (1)

where q = {qx(.)}x∈X ∪ {qy(.)}y∈Y , is the product probabilityof all item attributes. In essence, Eq (1) treats q as a hidden vari-able and integrates over it to define the marginal observed rank-ings for all users. There are several advantages to treating q as ahidden variable: 1) it makes it easier to combine continuous anddiscrete attributes (with varying cardinalities) as we map each at-tribute’s contribution into a number between [0,1]; 2) treating q as aprobability distribution opens the door for using several hierarchi-cal smoothing that combats data sparsity and allows us to introduceregularization in a principled way. To fully define our model weneed to answer the following three questions:

1. How to define q for both discrete and continuous attributes?2. How to define P (Ru|Γ, q)?3. How to optimize the seemingly intractable objective function

in Eq (1)?In the following two subsections, we will answer the first two

questions and then in Section 4, we will tackle the third question.

3.1 User Preference Models

3.1.1 The Intuition: Background SubtractionBefore delving into the technical details of the generative pro-

cess, we begin by giving an overview of our model. Let us assumethat items represent products, and that we have only one continuousattribute (price of the item), and one discrete attribute (brand of theitem). Moreover, let the nodes in the taxonomy represent productcategories. Let us first consider preferences over item prices. Giventhe items purchased in the dataset, one could postulate that there isa global (across-users) average price for each category, this is notthe average price of items under the category, but rather the aver-age price of the actual purchases of items under the category. Forexample, under the category smart phones, the average priceof phones purchased by the users is say $200 even though the av-erage price of the distinct phones might be much higher (or lower).However, this modeling assumption is rather inadequate as it ig-nores user personalized preferences over the category. To remedythat, we could hypothesize that users’ personal preferences can bemodeled as an addition over the global preference. For example,a user might be more inclined to buy expensive smart phones andas such her mean price over this category is above the category’sglobal mean price. Similarly, another user might have a mean pricethat is much smaller than the global mean price, i.e. she prefersto buy the cheapest smart phones. Therefore, we can look at theglobal price of a given category as the average across all users, andwe can model each user’s preferences as an addition over it (eitherpositive or negative).

However there is a problem with this approach as it ignores thetaxonomy. It is quite natural to assume that both the global andpersonalized preferences are smooth over the taxonomy. This hasthe advantage of combating data sparsity. To see this point, we re-fer the reader to Figure 1. This figure depicts the preferences of agiven user over the prices of different categories. In this figure, weobserved two purchases of this user under category c1 and we de-duced from them, that under category c1, the user’s average priceis above the global average price. Moreover, we observed a singlepurchase from this user under category c4 and we concluded thatthe user prefers inexpensive items under this category (i.e the useraverage price is below the global average price). However, whatcan we say about the user’s preferences over categories c2 and c3?Intuitively we expect c2 to follow the same pattern as c1 and c3 tofollow the same pattern as c4. However, the exact deviations fromthe global mean prices depend on two factors: First, how confident

Page 4: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

Con$nuous  A)ributes   Discrete  A)ributes  User  

Global  distribu$on  Global  distribu$on   User  distribu$on  User  distribu$on  

Discrete  A)ributes  Con$nuous  A)ributes  

X X Y Y

µux,n

µux,0µg

x,0µux,n θuy,n

θuy,0 θgy,0θgy,n

κ1 κ2γ1

γ2

I.yI.x

Figure 2: The graphical model, see text for more details, Section 2 for notations and Algorithm 1 for the full generative process

we are about our estimate for c1 and c4 and second, the variance ofprices over the tree (we will formalize these notions in the genera-tive process shortly). However, the main message of this figure isthat given a few observations about each user, we could still havea hypothesis about each user’s preferences over all the categoriesin the tree. Informally, this is possible, since we can pool dataacross users to create hypotheses about the global preferences andthen use these hypotheses to draw conclusions about each user (aprocess called smoothness – Section 4).

The same intuition applies also to discrete attributes such as thebrand of the item. Given all purchases, one could form a globalbrand preference over each category and then use it as a prior overthe preferences of each user. This process is known in the literatureas smoothness with a background model. However, this processalone does not achieve smoothness over close-by nodes in the tax-onomy. For example, let us assume that a user bought an iPhone,i.e a smart phone with "Apple" as the brand. Since smart phoneis a subcategory of "electronic gadget", we could infer that thisuser prefers to also buy other "Apple" products under other sub-categories of "electronic gadgets", such as "laptops" for instance.However, if we don’t observe any purchase from the user under the"electronic gadgets" category, we might shrink our estimate of theuser’s preferences over "electronic gadgets" to the global distribu-tion. We will see that our model achieves this property using anadditive model as we detail below.

3.1.2 The Generative ModelArmed with the above intuitions, we are now ready to detail the

generative process with reference to Figure 2 and Algorithm 1. Fora continuous attribute x, we let µgx,n denotes the global mean valueof this attribute at node n in the taxonomy. We connect those meansover the tree using a Gaussian-cascade process, where each node’svalue acts as the mean of a normal distribution that is used to gen-erate the children of node n. Put it another way, the mean of node nis generated from a normal distribution centered at the mean valueof its parent in the tree. The variance of this normal distribution de-pends on the level of the node n, and we denote it as κLevel(n). Weexpect the variance to be high for levels close to the root and smallfor levels close to the leaf categories. Similarly, for each user, welet µux,n denotes user u’s deviation at node n from the global meanat the same node. Again we connect those user-specific deviationsover the tree using a Gaussian-Cascade process with the same vari-ances used in the global process. We use the same variances sinceestimating a user-specific variance is prone to over-fitting. More-

Algorithm 1 Description of the generative model

Require: set of users U , set of items I1: Generate global preferences:2: for all x ∈ X // Continuous Attributes do3: for all categories n starting from the root downward: do4: Draw µgx,n ∼ N(µgs,π(n), κLevel(n))

5: end for6: end for7: for all y ∈ Y // Discrete Attributes: do8: for all categories n starting from the root downward: do9: Draw θgy,n ∼ Dir(γLevel(n)θ

gy,π(n))

10: end for11: end for12: Generate User Preferences13: for all u ∈ U do14: for all x ∈ X // Continuous Attributes do15: for all categories n starting from the root downward: do16: Draw µux,n ∼ N(µux,π(n), σLevel(n))17: end for18: end for19: for all y ∈ Y // Discrete Attributes: do20: for all categories n starting from the root downward: do21: Draw θuy,n ∼ Dir(γLevel(n)θ

uy,π(n))

22: end for23: end for24: Generate Attributes of Items the user interacted with25: for all i ∈ Iu do26: for all x ∈ X do27: i.x ∼ N(µgx,π(i) + µux,π(i), κL)

28: end for29: for all y ∈ Y do30: i.y ∼ Multinomial(θuy,π(i) + γLθ

gy,π(i))

31: end for32: end for33: end for

over, those variances act as priors and during inference we computea posterior value for each user based on his observations (see Sec-tion 4.1). Finally, for each user, we observe a set of purchaseditems under each leaf category (depicted as shaded nodes in Fig-ure 2 and as Iun in the generative process, i.e. item purchased by

Page 5: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

the user under leaf category n ). The final step in the generativeprocess is to generate the attributes of those items from a normaldistribution centered at the sum of the global mean and the user-specific deviation under each category (step 27). The variance ofthis distribution is κL where L is the depth of the tree (recall thatthe leaf categories, which are the parents of the items, appear atlevel L − 1 in the tree) . We should note here that the item at-tributes are not user specific, however, we still generate them foreach user but rather from a different user-specific distribution. Thisis the key step that allows the model to connect global means witheach user local mean as we will see during posterior inference. Putin another way, the item attributes acts as constraints that forces themodel during inference to factor the observed prices into a globaland user-specific components.

We model the discrete attributes of the items in a similar fashionbut with a Dirichlet-Multinomial cascade instead (step 7-11 for theglobal distributions and 19-23 for the user specific distributions).Recall that the Dirichlet is the conjugate prior for the Multinomialdistribution. Usually a Dirichlet distribution is specified as Dir(ρ)for a given parameter vector ρ, thus the mean and variance of thesampled multinomials depend on ρ. A more expressive way ofspecifying a Dirichlet distribution is as Dir(γθ) , here γ acts asthe variance and θ acts as the mean. The higher the value of γthe less the variance of the distribution. We use this representationto connect nodes over the taxonomy as the case of the Gaussiandistribution with θ playing the role of µ and γ playing the role of κ(see steps 9 and 21). We expect γ to be small for nodes close to theroot (to give high variances) and high for nodes close to the leaf (togive small variances). Finally to generate the discrete attributes ofitems purchased by a user, we use an additive model that combinesthe global and local distributions of the item’s category (step 30).In this case, we weight the global preference by a scalar value γLwhich specifies how much the user preference can deviate from theglobal distribution (this plays the same role as κL in step 27).

3.2 Item Ranking: P (Ru|Γ, q)How can we use the learned user preferences to perform recom-

mendation? In other words, how can we define the item preferenceranking probabilities, P (q)

∏u P (Ru|Γ, q)? Using the principle

of BPR, we first let zui denotes user u’s affinity towards item i andwe define it as follows:

zui =∑x∈X

αxqux(i.x) +

∑y∈Y

βyquy (i.y) + εui (2)

where α and β are attribute-specific weights that determine the im-portance of this attribute in computing the affinity. The preferenceprobability is computed as follows:

qux(i.x) = N(µgx,π(i) + µux,π(i), κL

)(3)

quy (i.y) = Multinomial(θgy,π(i) + γLθ

uy,π(i))

)(4)

The final component in computing zui is thus εui which is arandom noise that accounts for interactions not explained by theobserved attributes (which is termed as a random effect model inthe statistics literature [10]). We model εui using a latent factormodel as follows:

εui =< vu, vi > (5)

where < ., . > denotes dot product and vu ∈ RK is the user factorand vi ∈ RK is the item factor. The parameters of this system are:

Γ = {αx}x∈X , {βy}y∈Y , {vu}u∈U and {vi}i∈I .

4. PARAMETER LEARNINGWe seek to optimize the following objective function:

maxΓ

∫q

P (q)∏u

P (Ru|Γ, q)dq

However, this objective integrates over the space of all proba-bility distributions, which makes the optimization intractable. Toremedy this, we instead optimize the following variational lowerbound:

maxΓ

∫q

P (q)∏u

P (Ru|Γ, q)dq (6)

≥ maxΓ

∏u

P (Ru|Γ, q̂) (7)

where q̂ is the map estimate of q, i.e., the best distribution q thatexplains the user preferences over the item attributes. The abovelower bound gives rise to the following simple algorithm. Firstlearn the map estimate of q using the observed attributes of theitems bought by each user, and then use q̂ to learn the value ofΓ that best explains the item rankings by each user, Ru. We willtackle each of these problems in the following sections.

4.1 Posterior Inference of PreferencesGiven a set of user-item interactions, the goal is to learn a poste-

rior distribution over q, q̂. This is equivalent to learning the poste-rior values of the global parameters µg, θg and the posterior valuesof the user specific parameters µu, θu. We address the continuouscase in Section 4.1.1 and the discrete case in Section 4.1.2.

4.1.1 The Continuous CaseLet Iu be the set of items purchased by a given user, and let Ig

the set of all purchased items by all users. Our goal is to computeP (µgx|Ig) and P (µux|Ig, Iu) for all users u and for all attributesx, where µgx = (µg0, µ

g1, · · · , µ

gN ), and µux = (µu0 , µ

u1 , · · · , µuN ),

where N is the total number of nodes in the taxonomy. We givebelow the inference algorithm for a given attribute x, thus the fol-lowing algorithm is repeated for each attribute separately.

The problem of this posterior computation is the dependenciesbetween µgx and µux for all u ∈ U because of the additive formof the observation model (Step 27 in Algorithm 1). Therefore weresort to an alternating algorithm similar to loopy belief propa-gation, that computes each of these quantities conditioned on theother. Lets first assume that we know µxg . Thus conditioning onµgx, the inference problem over user’s means decouples into a setof |U| independent problems. For user u, we just need to com-pute P (µux|Iu, µgx). Once we compute this quantity for all users,we alternate and compute P (µgx|Ig, {µux}u∈U ). We repeat this al-ternating procedure until convergence, where convergence is as-sessed by measuring changes in the value of µgx. The algorithm issummarized below:

1. Repeat

(a) ∀u ∈ U :

i. Compute P (µux|Iu, µgx).

(b) Compute P (µgx|Ig, {µux}u∈U )

2. Until convergence.We focus here first on P (µux|Iu, µgx). This inference problem is

known in the literature as multi-scale Kalman filter [11], albeit witha shifted observations here. Recall that the value of the x attributeof all items I in Iu are generated from a normal distribution with

Page 6: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

mean = µgx,π(I)+µux,π(I). Since we hold µgx fixed, this is equivalentto generating the values I.x− µgx,π(I) for all I ∈ Iu form the leafcategories of the user tree µux,π(I), which is exactly the posteriorinference problem addressed by the multi-scale KF algorithm [11].

The multi-scale KF algorithm [11] proceeds in two stages: up-ward phase (from the leafs to the root) and downward phase (fromthe root to leaf). The upward phase is known as the filtering phase.In this phase we compute the posterior probability of each nodeconditioned on its immediate children. Once we reach the root ofthe tree, the root has now received information from all nodes inthe tree, and we move to the second downward phase known as thesmoothing phase. In this phase, we propagate information downthe tree to get the posterior mean of each node conditioned on thewhole tree (not only its children). The recursive equations for bothphases are given below in the following two subsections and inter-ested readers can refer to [11] for more details. We first define thefollowing quantities used by both the forward and backward phase:

Ψn =

Level(n)∑i=1

κi (8)

Fn = ΨLevel(n)−1

[ΨLevel(n)

]−1

where Ψn is the prior variance of node n and F is the prior covari-ance between the mean of node n and its parent.Filtering: upward phase

We begin at node n at levelL−1 on the tree. The children of thisnode are the shifted attributes of the items purchased by the userunder category n, Iu,n. We compute the probability of this nodebased on those children, P (µux,n|µgx,n, Iu,n) = N(φux,n, σ

ux,n),

where φ, σ are given as follows:

σux,n =ΨnκL

κL + |Iu,n|Ψn(9)

φux,n =σux,nκL

∑I∈Iu,n

I.x− µ̂gx,n (10)

We repeat the above computations for all nodes n at Level L−1.We then move one level at a time upward computingP (µux,n|µgx,n, Iu,T (n)) = N(φux,n, σ

ux,n) for each node n at levels

L − 2, L − 3, · · · , 0 as follows, where T (n) is the tree rooted atnode n as defined in Section 2:• For all nodes m ∈ In . i.e. the children of node n under

consideration, compute the following:

φux,n|m = Fmφux,m (11)

σux,n|m = F 2mσ

ux,m + Fmκlevel(m) (12)

where the above computes the filtering probability of node nconditioned on child m only (denoted by n|m above).• Now combine those estimates to find the filtering probability

of node n based on all of its children as follows:

σux,n =

[Ψ−1n +

∑m∈In

[σux,n|m]−1 −Ψn−1

]−1

(13)

φux,n = σux,n∑m∈In

φux,n|m [σux,n|m]−1 (14)

Smoothing: downward phaseThe recurrence in the upward phase ends at the root of the tree.

and we now have the posterior probability of root itself from thefiltering phase, that is µ̂ux,0 = φux,0 and σ̂ux,0 = σux,0, where we

note that the posterior probability of each node n in the tree isdistributed as N(µ̂ux,n, σ̂

ux,n). Now we propagate this information

down the tree computing the posterior probability of each node nby combining the filtering probabilities from its children, with thesmoothed probability from its parent as follows:

µ̂ux,n = φux,n + σux,n Fn

[µ̂ux,π(n) − µux,π(n)|n

][σux,π(n)|n

]−1

(15)

σ̂ux,n = σux,n + σux,n2 Fn

2

[σ̂ux,π(n) − σux,π(n)|n

][σux,π(n)|n

]−2

(16)

where the quantities π(n)|n are those computed in the upwardphase (Eq 11,12).

The above concludes the multi-scale Kalman Filtering Algorithm.It scales linearly with the number of nodes in the tree and as suchit is efficient. Once we compute the smoothed posterior values foreach user, we now repeat the same algorithm but over the globaltree. The equation remains exactly the same, except that Eq 9 and10 become:

σgx,n =ΨnκL

κL + |Ig,n|Ψn(17)

φgx,n =σgx,nκL

∑u∈U

∑I∈Iu,n

I.x− µ̂ux,n (18)

which means that the observation to the global tree are shifted byeach user’s smoothed posterior mean. The rest of the steps of theupward and downward recurrences remains the same.

4.1.2 The Discrete CaseIn this subsection we address the similar posterior problem ad-

dressed in the previous subsection but over discrete attributes. Ourgoal is to compute P (θgy,n|Ig) and P (θuy,n|Ig). We could use asimilar alternating algorithm similar to the one described for thecontinuous case. At each step of this algorithm we run a Dirichlet-Multinomial algorithm called Forward filtering, backward sampling[20] and iterate it until convergence. However, this approach is ex-pensive as it is slow to converge. Hence, we resort to an approxi-mation algorithm due to [7] known as the maximum-path algorithmover trees. This algorithm proceeds in two phases as well: upwardphases and downward phases. In the upward phase, we generatepseudo observations for internal nodes in the tree given the obser-vations at their children. In the downward phase, we smooth eachnode’s distribution with its parent. We give the algorithm below forthe user specific distribution and then for the global distribution.Let Cuy,n be the counts of observed values of attribute y at node n.C is a vector whose components gives the observed counts for eachpossible value of y. For example, if y is the brand, then each com-ponent of C would contains the number of items purchases fromthis brand at node n. If n is a leaf category, then C is alreadyobserved.

• Upward phase: in which, we compute Cuy,n for all nodes nabove the leaf category levels (at the leaf C is observed).

Cuy,n =∑m∈In

Cuy,m (19)

Which has the intuitive interpretation that Cuy,n is just thesum of the observed counts at the subtree rooted at node n.• Downward phase: Once we reach the root in the upward

phase, we compute the smoothed distribution of the root nodeas follows:

θ̂uy,0 = Normalize(Cuy,0 + γ0) (20)

Page 7: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

where Normalize just makes the input vector sums to one.We then proceed downward computing the smoothed distri-bution at all internal nodes as follows:

θ̂uy,n = Normalize(Cuy,n + γLevel(n)θ̂uy,π(n)) (21)

The above algorithm computed the posterior over θuy for eachuser and then we ran it again to compute the global distribution θgy .The same equation above holds except that at the leaf categoriesCgy,n: for a leaf category n θgy is computed using Ig,n which isthe set of observed purchases from all users under category n. Fi-nally, the posterior distribution used to predict how user u is likelyto purchase an item under a leaf category n is computed using:Normalize(Cuy,n + γLevel(n)θ̂

uy,π(n) + γLevel(n)θ̂

gy,n).

4.1.3 SummaryTo summarize, given a set of continuous item attributes X and

discrete item attributes Y , our goal is to learn a user preference overeach of them. This preference is learned as a distribution over thoseattributes. To combat sparsity, we parametric these distribution asan additive model of two parts: a global component which is sharedacross all the users, and a user-specific components that shifts theglobal component to personalize the distribution. We learn thosetwo parts for each continuous and discrete attribute using upward-downward algorithms. In the case of continuous attributes, we usean iterative multi-scale Kalman algorithm until convergence. Andin the discrete case we use a non-iterative two pass approximatealgorithm. Each of these algorithms scales linearly with the numbernodes of the tree. We noticed in practice that the iterative Kalmanfiltering algorithm converges in 5 to 10 iterations. While the outputof these algorithm are useful by itself as we demonstrated in Figure1, we use it as input to augment a latent-factor model to performthe end-end recommendations as we detail in the next subsectionFinally, we treat the variances over the edges of the trees κi andγi as parameters and we estimate them using cross validation withregard to their effect of the end-end recommendation. To reducethe number of parameters we use the following parameterization :

κi =κ

iγi = γ ∗ i (22)

This parameterization implements our intuition that the variancesof the parameters dimension as we go down the tree.

4.2 Learning Item Ranking Parameters: Γ

In this subsection, we describe how to find the Γ that maximizesP (R|Γ, q). We decided to exploit recent advances in learning fac-torization models and rely on the so-called discriminative Bayesianpersonalized ranking (BPR) which has been shown to significantlyoutperform generative training [18].Based on Section 2, we seek tooptimize the following log-likelihood function:

p(Γ|Ru, q) = p(Γ)p(Ru|Γ, q) (23)

= p(Γ)∏u∈U

∏i∈Iu

∏j /∈Iu

p(zui > zuj |Γ, q)

where, we approximate the non-smooth, non-differentiable expres-sion xui > xuj using the logistic sigmoid function σ(xui − xuj),where σ(z) = 1

1+e−z , which results in maximizing:∑u

∑i∈Iu

∑j 6∈Iu

lnσ(zui − zuj)− λ||Γ||2 (24)

where, λ is the regularization constant which arise from a Gaus-sian prior over the parameters. and ||Θ||2 is given by the following

expression:

||Γ||2 =∑u

||vu||2 +∑i

||vu||2 +∑x∈X

α2x +

∑y∈Y

β2y

We use stochastic gradient descent to optimize the above func-tion. At each iteration, we pick a triplet (u, i, j) where the useru bought item i and did not buy item j and define its associatedloss L(u, i, j) as follows:

L(u, i, j) = lnσ(zui − zuj)− λ||Γ||2

= lnσ[〈vu, vi − vj〉+

∑x∈X

αx(qux(i.x)− qux(j.x)

)+

∑y∈Y

βy(quy (i.y)− quy (j.y)

)]− λ||Γ||2

Now, we need to compute the gradients of L(u, i, j) with respectto the parameters Γ = {vu, vi, vj , αx, βy} and perform a gradientstep over those parameters. It is quite straightforward to show thatthis would give rise to the following update rules:

vu = vu + ρ(cuij(vi − vj)− λvu

)vi = vi + ρ

(cuijvu − λvi

)vj = vj + ρ

(− cuijvu − λvj

)αx = αx + ρ

(cuij(q

ux(i.x)− qux(j.x))− λαx

)βx = βx + ρ

(cuij(q

uy (i.y)− quy (j.y))− λβy

)(25)

Where ρ is the learning rate and cuij = 1 − σ(zui − zuj). Weiterate sampling a tuple and updating the parameters until conver-gence (we used 100 epochs, where each epoch is a pass over thetraining data).

4.2.1 Hierarchical Extension: hLFUMSince we have access to a taxonomy over items, we can also

constraint the item factors to be smooth over the taxonomy usingan additive model as follows. Recall that items appear at level L inthe taxonomy. We define vi as follows:

vi =

L∑l=0

wπl(i) (26)

where πl(i) is the lth parent of item i in the taxonomy. And wn isa latent factor assigned to each internal node in the taxonomy. Thusfor two sibling items i, i′, they will share all the nodes starting fromtheir parent up until the root, and as such only differ on wi, wi′in the above summation and such vi, vi′ would become close toeach other in the latent space. A similar additive model have beenproposed before in [17]. The update rules in (25) for vu, αx, βyremains the same using vi defined in (26). A new update rule for wshould be added and can be derived using the chain rule as follows:

wπl(i) = wπl(i) + ρ(cuijvu − λwπl(i)

), l = 0, 1, · · ·L

wπl(j) = wπl(j) + ρ(− cuijvu − λwπl(j)

), l = 0, 1, · · ·L

which means that we update each weight on the path form the rootto the item using the same gradient we used to update vi in (25)before (this is easy to see from the chain rule over (26)).

Page 8: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

5. EXPERIMENTAL SETUP

5.1 DatasetTo evaluate our proposed models, we used a log of user online

transactions obtained from a major search engine, email providerand online shopping site and join it with a dataset of user conver-sion on display advertising where the ads correspond to products.The dataset contains information about the historical purchases ofusers over a period of 3 months and their response to display ad-vertising Ads shown to them. We fully anonymize the users bydropping the original user identifier and assigning a new, sequen-tial numbering of the records. For anonymity reasons, we report re-sults over a sample of the above data. In the sample, we have about40,000 users with average 2.3 purchases/ad response. We have500,000 distinct individual products that are purchased/clicked uponin the dataset, which is mapped to a publicly available shoppingtaxonomy [1] which gives the price and the brand for each itemas well. We use price as an example of a continuous attribute andbrand as an example of a discrete attribute. The (resulting) taxon-omy we use has about 1.5 million individual products in the leaflevel organized into a taxonomy 3 levels deep, with around 1500nodes at lowest level, 270 at the middle level and 23 top level cat-egories. For each user, we select a transaction and mark all subse-quent transactions/ad responses into the test dataset. All previoustransactions/ ad responses are used for training. The last transac-tion/ad response in the training dataset is used for cross-validationand first transaction/ad response in the test dataset is used for pre-diction and reporting the error estimates.

5.2 MetricsWe use the AUC metric described below to compare our model

with the baseline systems. AUC is a commonly used metric fortesting the quality of rank orderings. Suppose the list of items torank is X and our test transaction is B. Also suppose r(x) is thenumerical rank of the item x according to our model (from 1 . . . n).Then, the formula to compute AUC is given by:

1

|B||X \B|∑

x∈B,y∈X\B

δ(r(x) < r(y))

While we could have used an alternate metric such as preci-sion/recall at a given rank in the list, it requires selecting a suitablerank value. In contrast, AUC combines the prediction performanceover all ranks into a single number and is therefore convenient touse.

6. EVALUATIONIn this subsection, we study the performance of our proposed

model LFUM and compare it with several baselines. We user thefollowing system for comparison:

• LFUM is the model proposed in this paper with NO taxon-omy over the item factors.• hLFUM is the model proposed in this paper with a taxonomy

over the item factors.• LF is a latent factor model. i..e uses no item attributes• hLF is latent factor model with a taxonomy over the item

factors.• UM: is a user preference model only. That is no latent factors

are used. We just train the smoothed user preferences. (i.e.we set εui to zero)• HybridLF: this is the hybrid LF model described in Section

2.3.

10 20 30 400.6

0.65

0.7

0.75

0.8

Number of factors

Aver

age

AUC

Comparison with Baseline

LFhLFLFUMhLFUM

Figure 3: Comparing LFUM and hLFUM against several baselines alongdifferent number of factors

0.6  

0.65  

0.7  

0.75  

0.8  

0.85  

10   20   30   40  

Average  AU

C  

Number  of  factors  

HybridLF  

LFUM  [brand]  

Figure 4: Comparing our approach LFUM with a fancier version of the stan-dard hybridLF described in Section 2.3. The figure shows that our approachis more efficient in utilizing the extra information (brand in this case).

All models are trained using the BPR algorithm and all regular-ization parameters are tuned on a validation set.

6.1 Comparison to Baselinesin Figure 3 we compare LF, hLF, LFUM and hLFUM over dif-

ferent number of factors. As evident form the figure hLFUM andLFUM significantly outperformed models that ignore user prefer-ences. Notice that LF does not use taxonomy in any way but hLFused taxonomy to constraint the item factors and combat sparsity.Nevertheless, hLF is till lagging even behind LFUM which does notimpose a taxonomy constraint over the latent factors. It should benoted that both LFUM and hLFUM used taxonomy to smooth theuser preferences, but only hLFUM used the taxonomy to smooththe item factors as well.

Second, we compare with the standard HybridLF model fromSection 2.3. To make the comparison fair we only use the brandattribute to avoid introducing any bias in the result due to improperthresholding of the price attribute (note that our model does notneed any such thresholding but we still use only the brand attributein this experiment). The range of the brand attribute is 17K, thuswe have Y = 17K, since we need to add a binary variable foreach brand. Furthermore, to make the HybridLF stronger, we stip-ulate that Wy= Wu

y + W 0y + W c

y , i.e., we use an additive modelto define Wy that combines user-specific weights with background

Page 9: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

10 20 30 400.6

0.65

0.7

0.75

0.8

Number of factors

Aver

age

AUC

Ablation Study

hLFhLFUM[G]hLFUM[G+H]hLFUM[G+H+P]

Figure 5: Studying the effect of the method used to learn user preferenceson the final performance of our model. The figure shows that each part ofour construction adds to the final result

weights W 0y and the item’s category weights. We found that this

form performed the best as opposed to using a global only weightvector, W 0

y , or a user-specific only weight vector, Wuy . The results

are shown in Figure 4. As evident, our model outperforms thatbaseline which confirms that our approach does not suffer from at-tributes with high cardinalities and that our model makes better useof the item attributes.

6.2 Ablation StudyIn this subsection we study the contribution of each part of our

model to understand the source of our improvement over the base-lines. In Figure 5, we first study the smoothness effect over the userpreference models. We parameterize LFUM and hLFUM with thethe technique used to compute the user preferences over items asfollows:

• G: means that we ignore the user specific distribution and thehierarchy and only compute the global preferences over eachleaf category individually and use it for all users.• G+H: same as G but we in addition we use the tree to smooth

the computation of the global preferences over item attributes.However , no user personalization is used.• G+H+P: is the full model described in Section 3 where the

preferences are computed using an additive process smoothedover the taxonomy.

We compare these configurations against a hLF model by vary-ing the number of factors. As obvious from this figure using onlythe global preference with no smoothing results in minimal im-provement and sometime even degrade performance for small num-ber of factors. As the number of factors increases, the model learnedto user the extra factor to counter that detrimental effect. However,smoothing those global preference over the tree improves the per-formance. At first this might be puzzling, but this effect can beattribute to the wisdom of the crow effect. In other words, if the ma-jority of the users prefer a given brand, then there is a chance thatrecommending this brand under a category that interest the user al-ready (which can be inferred from the LF part of the model) wouldhelp the recommendation process. Finally adding user personaliza-tion significantly improved the result.

In Figure 6 we performed another ablation study to study if theimprovement brought by LFUM are only due to the item prefer-ences. In this case we compare the UM model against LFUM using40 factors and by varying the technique used to smooth the prefer-ences. As shown in this figure, while UM is still a strong baseline,LFUM outperforms it which shows the importance of using a latent

G G+H G+H+P0.55

0.6

0.65

0.7

0.75

0.8

Configuration

Aver

age

AUC

LFUM vs. UM only

UMLFUM

Figure 6: Comparing the performance of the full model against a model thatonly uses items’ attributes

10 20 30 40

0.6

0.65

0.7

0.75

0.8

Number of factors

Aver

age

AUC

Attribute type study

hLFhLFUM[Brand]hLFUM[Brand+Price]

Figure 7: Studying the effect of different attribute types on the final result

factor model to account for the unobserved sources that draws theuser to a given item.

6.3 Study of Attribute TypeIn this subsection we study the effect of attribute type on the

final result. We compare three models: hLF, hLFUM trained withbrand preferences only and hLFUM model trained with both brandand price preferences. Both LFUM and hLFUM uses the defaultpreference learning mechanisms (G+H+P). As can be seen fromthe result both attributes bring improvements, although we noticesin general that brand preferences brings more improvements due tobrand-loyalty: a user who like "Apple" product for instance, wouldstill buy "Apple" product even if "Apple" increase the price of itsproducts. Nevertheless, price by itself still provides a strong signal.

6.4 Effect of Item FrequencyWe compare the performance of our approach hLFUM and with

the baseline hLF as the frequency of items changes. As shownin Figure 8, our approach significantly outperforms the baselineespecially when the item frequency is small (i.e. cold-start items).We omit the result of hybrid LF as it was worse than hLF as weshowed in Figure 4

7. RELATED WORKThere is a wealth of literature about latent factor models for rec-

ommendation (see [15] for a survey). Conventional recommenda-tion algorithms do not use user/item attributes, which can limit theirperformance when the feedback data is sparse. In recent work, sev-eral authors propose to augment the latent factor model with item

Page 10: Latent Factor Models with Additive and Hierarchically ...Bhargav Kanagal* Google Inc. bhargav@google.com Sandeep Pandey* Twitter spandey@twitter.com Vanja Josifovski* ... data sparsity

0.55  

0.6  

0.65  

0.7  

0.75  

0.8  

0.85  

<25%   [25%,  50%  [   [50%  -­‐  75%  [   >=75%  

Average  AU

C  

Frequency  Percen1le  Range    

hLF  

hLFUM  

Figure 8: Illustrating the difference in performance between hLF andhLFUM as a function of item’s popularity. Each cell in the x-axis repre-sent items in a given frequency percentile range. Note that our approachconsistently gives good results across different frequency ranges. Numberof factors = 30.

and user attributes. This work was pioneered by Agarwal et al. [2].However, the authors used very high-dimensional item and user at-tributes and use them to constraint the projection of users and itemsinto the latent space. In contrast, in our work we only use few at-tributes (two in our study, brand and price) and therefore this ap-proach is not applicable in our settings. Moreover our approach isunique as it first uses an additive model to preprocess the attributesbefore combining them with the latent factor model. This has theadvantage of turning a discrete attribute with a large cardinality intoa single informative feature. For instance, an alternative approachis to add a binary feature for each distinct brand, however that willproduce a lot of noise and significantly slow down learning.

Our work is also related to hybrid recommender systems whichcombine two or more techniques, such as content and collaborative-filtering, to make recommendations [6]. We make several novelcontributions in this work. We derive user preferences (e.g., brand,price) in terms of the global preference model and a per-user per-sonalized model. The former captures the global trend and is veryuseful for sparse users, while the latter allows us to capture the in-dividual characteristics for the dense users. In other words, for newusers we leverage the global model, and as we get more and moredata about a user, his/her preference model gets increasingly per-sonalized. Moreover, we exploit the taxonomy while learning thesepreference models (in contrast to Yu et al. [21]). This allows us todeal with sparsity, e.g., preferences over different phones can bepropagated to understand preferences over siblings in the hierarchysuch as computers, printers, as well as parents such as electronics.

8. CONCLUSIONSIn this paper we presented a novel approach to integer content-

based and latent factor based models. We learn a user preferencemodel over item attributes that factors out common parts sharedacross users via an additive model, thus allowing user preferencesto be represented in a discriminative fashion. Moreover, we showedhow to integrate these learned preferences with a latent factor modeland train the resulting system end-end to optimize a Bayesian dis-criminative ranking loss function. Our approach can handle con-tinuous and discrete attributes and the complexity of learning andinference is weakly dependent no the cardinalities of the item at-

tributes. We demonstrated the efficacy of our approach in a smartad retrieval task with very promising results over several baselines.In the future we plan to incorporate the dynamics of user interests[4] and scale our inference to hundred of millions of users usingtechniques from [3]

9. REFERENCES[1] Pricegrabber. http://www.pricegrabber.com/.[2] D. Agarwal and B.-C. Chen. Regression-based latent factor

models. In KDD, pages 19–28, 2009.[3] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and

A. J. Smola. Scalable inference in latent variable models. InWSDM, pages 123–132, 2012.

[4] A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola.Scalable distributed inference of dynamic user interests forbehavioral targeting. In KDD, pages 114–122, 2011.

[5] J. Bennett, S. Lanning, and N. Netflix. The netflix prize. In InKDD Cup and Workshop in conjunction with KDD, 2007.

[6] R. D. Burke. Hybrid recommender systems: Survey andexperiments. User Model. User-Adapt. Interact.,12(4):331–370, 2002.

[7] P. Cowans. Probabilistic document modelling. 2006.[8] A. Gunawardana and C. Meek. Tied boltzmann machines for

cold start recommendations. In RecSys, pages 19–26, 2008.[9] A. Gunawardana and C. Meek. A unified approach to

building hybrid recommender systems. In RecSys, pages117–124, 2009.

[10] P. D. Hoeff. Bilinear mixed effects models for dyadic data.Journal of the American Statistical Assosciation, 2005.

[11] A. W. K. Chou and A. Benveniste. Multiscale recursiveestimation, data fusion, and regularization. IEEETransactions on Automatic Control, 39(3), 1994.

[12] N. Koenigstein, G. Dror, and Y. Koren. Yahoo! musicrecommendations: modeling music ratings with temporaldynamics and item taxonomy. In RecSys, 2011.

[13] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. In KDD, pages426–434, 2008.

[14] Y. Koren. Collaborative filtering with temporal dynamics. InKDD, pages 447–456, 2009.

[15] Y. Koren and R. M. Bell. Advances in collaborative filtering.In Recommender Systems Handbook, pages 145–186. 2011.

[16] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorizationtechniques for recommender systems. IEEE Computer,42(8):30–37, 2009.

[17] A. Mnih. Taxonomy-informed latent factor models forimplicit feedback. In KDD Cup, KDD 2011, 2011.

[18] S. Rendle, C. Freudenthaler, Z. Gantner, andL. Schmidt-thieme. L.s.: Bpr: Bayesian personalized rankingfrom implicit feedback. In UAI, 2009.

[19] S. Rendle and L. Schmidt-Thieme. Pairwise interactiontensor factorization for personalized tag recommendation. InWSDM, 2010.

[20] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.Hierarchical dirichlet processes. Dec. 2006.

[21] K. Yu, J. D. Lafferty, S. Zhu, and Y. Gong. Large-scalecollaborative prediction using a nonparametric randomeffects model. In ICML, page 149, 2009.