beyond collaborative filtering: the list recommendation ... · evaluation. using this evaluation...

Beyond Collaborative Filtering: The List RecommendationProblem

Oren Sar Shalom∗Bar Ilan University and

IBM Research

Noam KoenigsteinMicrosoft

Ulrich PaquetMicrosoft

Hastagiri P. Vanchinathan∗ETH Zürich

ABSTRACTMost Collaborative Filtering (CF) algorithms are optimizedusing a dataset of isolated user-item tuples. However, incommercial applications recommended items are usually servedas an ordered list of several items and not as isolated items.In this setting, inter-item interactions have an effect on thelist’s Click-Through Rate (CTR) that is unaccounted forusing traditional CF approaches. Most CF approaches alsoignore additional important factors like click propensity vari-ation, item fatigue, etc. In this work, we introduce thelist recommendation problem. We present useful insightsgleaned from user behavior and consumption patterns froma large scale real world recommender system. We then pro-pose a novel two-layered framework that builds upon ex-isting CF algorithms to optimize a list’s click probability.Our approach accounts for inter-item interactions as wellas additional information such as item fatigue, trendinesspatterns, contextual information etc. Finally, we evaluateour approach using a novel adaptation of Inverse PropensityScoring (IPS) which facilitates off-policy estimation of ourmethod’s CTR and showcases its effectiveness in real-worldsettings.

KeywordsCollaborative Filtering, Click prediction

1. INTRODUCTION AND MOTIVATIONThe Netflix Prize competition [1] was a seminal event in

recommender systems research. The original set-up of thecompetition set the stage for much following research. Overthe years, additional directions have been explored such asimplicit feedback [9], the cold-start problem [21], rank op-timization [19, 23, 30] etc. Yet, the original setting of op-

∗This work was done while at Microsoft R&D Center inIsrael.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2016, April 11–15, 2016, Montréal, Québec, Canada.ACM 978-1-4503-4143-1/16/04.http://dx.doi.org/10.1145/2872427.2883057.

timizing for isolated user-item tuples still dominates con-temporary recommender systems research. In this work wetry to break away from this paradigm and consider a struc-tured approach to investigate and model the more complexdynamics of the list recommendation problem.

The list recommendation problem can be defined as fol-lows: Given a specific user u at time t our goal is to producean ordered personalized list of K items lut = {i1, i2, . . . , iK}that maximizes the probability that u will click on an itemfrom lut. Note that we assume that the same set of recom-mended items may yield different click probabilities whenordered differently, hence this formulation is different thanthat of the set recommendation problem (unordered lists). Itis also different from learning to rank approaches for collab-orative filtering that train on pairwise relations but ignoredeeper interactions between items.

In Section 2, we review in detail the main prior work re-lated to the list recommendation problem. The approachpresented in this paper has several key advantage pointsover prior work: First, as we show in Section 4, the ex-pected CTR of a list lut depends on many complex factorsbeyond the user-item ratings such as diversity, item fatigue,popularity trends, contextual factors, the list layout on thescreen, and many others. Prior approaches have consideredsome of these factors individually, but in this work we con-sider all of them jointly accounting for possible interactions.Second, many previous studies focused on the diversity fac-tor assuming that a trade-off exists between accuracy anddiversity. In contrast, we propose a structured approach inwhich diversity is optimized alongside the overall accuracy.Finally, many ranking approaches for collaborative filteringattempt to rank the entire catalog of items whereas in ourformulation we optimize only for a list of K items presentedto the user. The number of items (K) is typically small andtherefore we can consider the specific position of each itemin the list. We do not assume that the list order implies anorder on the quality or fitness of the items. Namely, we donot assume that the first item is better than the second itemwhich in turn better than the third item etc. Instead, welearn the optimal ordered combination of items based on theindividual characteristics of each item, its position in the listand interactions with other items. For example, in Section 4we demonstrate a surprising pattern for the interaction be-tween item-ratings (accuracy) and diversity that varies withthe item’s position in a list.

In this paper we treat the classical collaborative filter-ing problem of predicting ratings for user-item tuples as a

63

“solved” problem and look beyond it at the list recommen-dation problem. We assume we already have a collaborativefiltering algorithm implementation (in our case the Xboxrecommender [11, 17]) that produces accurate predictionsto user-item ratings. We utilize this existing algorithm andtreat its predictions as features for a “second layer” of learn-ing aimed at optimizing the list of items to be presentedto the user. Hence, our set-up is a two-layered learningapproach in which the collaborative filtering algorithm istrained in the first layer to serve as features (together withadditional information) for the second layer. This secondlayer can potentially use a very different dataset from thefirst layer. For example, the Xbox recommender system usesa dataset of purchase signals (implicit binary ratings) tolearn a matrix factorization model as our first layer [17].Then, the second layer solves a classification problem basedon a dataset of list impressions in which the binary labelindicates whether a user clicked on an item from a list.

Our ultimate goal is to optimize the lists’ Click-ThroughRate (CTR). Therefore, one may question the need for dif-ferent datasets in the first and second learning layers. Basedon our experience, collaborative filtering algorithms such asmatrix factorization are very effective in learning interests or“taste”, yet they are less useful in optimizing for CTR. Thistwo-layered set-up effectively extracts the relevant informa-tion from the first dataset in order to improve predictions inthe second one. Furthermore, many commercial companiesalready possess the first collaborative filtering layer and em-ploy different heuristics to generate recommendation lists.We chose a two-layered approach for practical reasons asit builds upon those existing systems already in productionwhile proposing a structured approach to compose the finalrecommendation lists.

The contributions of this paper are threefold: (1) We in-troduce the list recommendation problem and investigateseveral factors that affect lists’ CTR using insights fromthe Xbox recommender system. (2) We propose a simpleyet effective two-layered approach that builds upon existingcollaborative filtering solutions (first layer) in order to findthe optimal list to each user (second layer). As explainedin Section 2, our approach is novel and differs from previ-ous research by considering the entire recommendation listas unit. (3) In Section 6.2, we propose a novel modifica-tion of Inverse Propensity Scoring (IPS) [25] for off-policyevaluation. Using this evaluation technique, we empiricallyillustrate the benefits of our approach and estimate a CTRlift of up to 3x in Xbox’s main dash and up to 2.7x in Xbox’srecommendations dash.

2. BACKGROUND AND RELATED WORKCollaborative Filtering (CF) techniques have been used

widely in recommender systems. One of the key devel-opments in CF research are Matrix Factorization (MF)methods [12] which proved their usefulness in competitionssuch as the Netflix Prize [1], KDD Cup’11 [6] and others.Xbox recommendations also relies upon a MF algorithm [11,17] which is used in this paper for the “first layer” of ourmethod.

The list recommendation problem is related to the top-K[3] optimization problem. However, most work has been con-cerned with ranking techniques borrowed from search prob-lems, while work focused specifically on recommending listshas been sparse. In the context of recommender systems,

learning to rank approaches have been applied mostly asextensions to MF models [19, 23, 24]. These approachesgenerally modify the loss function of the MF model to fita ranking task. However, they maintain the “traditional”paradigm in which isolated user-item tuples are ranked oneagainst the other rather than considering the entire recom-mendation list as a target for optimization. Hence they ne-glect many relevant factors such as inter-item interactionsor contextual information. Furthermore, in recommenda-tion scenarios it is possible that the best item should not beplaced in the first position but rather in a different locationdepending on the specific layout. For example, in horizontalrecommendation lists (e.g., Netflix) it is likely that the bestitems should be placed in the middle of the list rather thanat the edges. In this paper we learn to optimize an orderedcombination of K items without ranking items one againstthe other.

This work is also related to different previous studies onclick prediction. Most studies were focused on either spon-sored search results or advertisement [2, 8, 15, 22]. At theheart of this paper is in fact a click prediction model de-signed specifically for CF recommender systems such as theXbox recommender system.

Some prior work on list recommendation incorporatefatigue, temporal features and diversity independently [13,18, 31, 35]. However, these critical building blocks of recom-mendation list optimization show complex interaction amongstthemselves and therefore should be modeled together. Forexample, in the case of list diversity we provide a real-worldaccount from Xbox showing that the benefits of diversitystrongly depend on the user’s preferences (Section 4.1).

A plethora of studies discusses what is known in literatureas the accuracy vs. diversity trade-off [29, 34]. For exam-ple, Jambor and Wang [10] stated that “accuracy should notbe the only concern” in a recommender system and appliedconstrained optimization to promote long tail items. Zhanget al., [33] performed a user study showing that introducingnovelty and serendipity into music recommendations whilelimiting the impact on accuracy can increase user satisfac-tion. Rodriguez et al., [20] used A/B testing to show thata multi-objective optimization approach for LinkedIn’s Tal-entMatch system can increase reply rate by 42%. In contrastto these studies and others, this paper does not address theproblem in terms of balancing a multi-objective trade-off.Instead, we present a structured approach to learn the rightamount of diversity from data in a manner that effectivelyimproves our ultimate utility which is the system’s CTR.

Finally, recommender systems can be viewed as onlinelearning tasks with partial feedback in which multi-armbandit algorithms can be applied. Previous papers focusedmainly on the domains of online advertising and recommend-ing news articles [14, 26, 27] where it is critical to balancebetween exploration and exploitation. Top-K recommenda-tions via contextual bandits were studied in [28] and shownto be superior to learning-to-rank approaches. However, inthis work we present a principled approach that utilizes tra-ditional collaborative filtering algorithm and are not con-cerned with the explore-exploit dilemma.

3. DATASETAs discussed in Section 1, our two-layered approach uses

different datasets for each learning layer. The first layer isbased on a one-class collaborative filtering model which is

64

Figure 1: Xbox 360 Main Dash - a vertical list of 2items is highlighted on the right (in green).

trained using purchase data from tens of millions of Xboxusers. The model and dataset for the first layer are describedin [17]. The predictions of the first layer are used as featuresfor the second learning layer which is geared towards learn-ing recommendation lists and trained using clicks collectedfrom Xbox.

Figure 1 depicts the Xbox 360 main dash1. A list of twopersonalized items are presented on the upper right handside. We call this list the main dash list. Below the maindash list, the user can find the all recommendations buttonwhich takes the user to the recommendations dash. In therecommendations dash, the user is presented with a hori-zontal list of 20 items. Both lists are refreshed before eachimpression by heuristically sampling a new diverse set ofitems using the predicted item scores from the first layer.In this paper we propose to replace these heuristics using asecond layer based on supervised learning.

Our datasets are composed of impressions and clicks fromthe main dash and recommendations dash collected duringone month in Q4, 2014. The total number of data points col-lected in this period is much larger than the datasets used forthis paper. We removed users with more than 100 impres-sions and further “diluted” both datasets by randomly sam-pling users. The main dash dataset is composed of a sampleof 5, 339, 456 impressions served to 1, 406, 470 users. We re-served the last (chronologically later) 10% impressions for anevaluation test-set. Similarly, for the recommendations dashdataset we used a sample of 6, 319, 843 impressions servedto 1, 423, 640 users and reserved the last 10% for evaluation.Finally, we created binary supervised datasets by associat-ing a positive label to impressions which resulted in a click(on any item) and a negative label with impressions that didnot yield a click.

The characteristics of these two datasets are very different:The main dash list is a vertical list of just two items. It is anexample of a passive scenario in which the user hasn’t pro-actively requested recommendations. Instead, she is pre-sented with the list while she is most likely focused on adifferent area of the screen. In such passive scenarios, CTRis typically very low and therefore we created a balanceddataset by uniformly sampling the negative impressions foreach user. In contrast, the recommendations dash dataset isan example of an active scenario in which the user has ac-tively requested recommendations. This yields much higherCTR levels, hence no dataset balancing was required. This

1The dashboard’s appearance changes with different updateversions. This image captures its appearance at the time ofdata collection.

0.00.2

0.40.6

0.81.0

0.2

0.4

0.60.8

1.0

0.1

0.2

0.3

0.4

Rating

Similarity

Clic

k P

robabili

ty

0.00.2

0.40.6

0.81.0

0.2

0.4

0.60.8

1.0

0.1

0.2

0.3

0.4

Rating

Similarity

Clic

k P

robabili

ty

0.00.2

0.40.6

0.81.0

0.2

0.4

0.60.8

1.0

0.1

0.2

0.3

0.4

Rating

Similarity

Clic

k P

robabili

ty

0.00.2

0.40.6

0.81.0

0.2

0.4

0.60.8

1.0

0.1

0.2

0.3

0.4

Rating

Similarity

Clic

k P

robabili

ty

(a) Upper slot: Empirical (b) Lower slot: Empirical

(c) Upper slot: Predicted (d) Lower slot: Predicted

Figure 2: Empirical and estimated click probabilityfor items in the upper and lower slots on Xbox’smain dash as a function of the predicted rating andsimilarity to the other item in the list.

dataset also differs in its presentation (horizontal insteadof vertical) and list length (20 items as opposed to 3). InSection 7 we notice the influence of these differences on thealgorithm’s results.

4. INSIGHTS FROM THE XBOX RECOM-MENDER SYSTEM

In this section we present some insights from the Xboxrecommender system that illustrate the possible shortcom-ings of contemporary collaborative filtering techniques.

4.1 Inter-item Similarity InteractionsRecommending a list of size K is typically very different

from ranking top-K items. One key difference is the factthat items’ relevance is not independent and diversity/similarityplays a significant role in determining the list’s click prob-ability. In contrast to many previous works that considersan “accuracy vs. diversity trade-off”, we use data from theXbox recommender system to show that the actual rela-tionship between accuracy and diversity is, in fact, morecomplex.

Figure 2 depicts the click probability items in Xbox’s maindash as a function of both their rating and the Jaccard simi-larity to the other item presented in the same recommenda-tion list. Sub-figures (a) and (b) depict the empirical clickprobability surfaces for the items in the upper and lowerslots respectively. Sub-figures (c) and (d) depict the pre-dicted probability surfaces for the upper and lower slots re-spectively using a simple logistic regression model with athird degree polynomial kernel. For the upper slot we see aclear positive correlation between the predicted rating com-ing from the MF model and the click probability, howeverthere seems to be very little relevance to the similarity with

65

2 4 6 8 10 12 14

7.5

15

22.5

30

Average impressions before click

% o

f Ite

ms

Figure 3: A histogram of the number of impressionsbefore a click event.

second item below it. This is not the case with the loweritem where a more complex pattern is revealed: For highpredicted ratings, it is best to not diversify and recommend“more of the same”, but if the predicted rating is low it isbetter to “hedge bets” by diversifying the list. We explainthe difference between the upper and lower items by consid-ering the specific layout of the main dash recommendationlist (Figure 1). The typical user probably notices the upperitem before the lower one. Hence, the upper item’s clickprobability is independent of the lower item but the loweritem’s click probability is highly dependent on what is shownabove it.

The Xbox recommender is based on a Bayesian modelthat models the posterior probability of a user to purchasean item [17]. These probabilities capture both preferencesand uncertainty. From Figure 2 we learn that if the systemhas more certainty in user’s preferences it is better to presenta list with several similar items even if it means lower di-versity. However, when the system’s estimation of the userpreference has less certainty, more clicks may be obtainedby diversifying the items on the list. Based on these in-sights, the need to diversify varies and depends on both thepredicted rating, the list’s layout and each item’s position.

4.2 Item FatigueItem fatigue occurs when a user is recommended the same

item multiple times. After several impressions of the sameitem to the same user, the click probability decreases. InXbox, we refresh the recommendation list prior to each visit,however some items are repeated multiple times. In the maindash there exists a noticeable effect of item fatigue on theoverall CTR. In Figure 3, we demonstrate this effect by in-vestigating the variation in the number of impressions beforea click event is observed. For each click event, we countedthe number of times the user was presented with that sameitem prior to clicking on it. We averaged these across allusers and items and present the histogram of the number ofimpressions preceding the first click. The histogram exposesan interesting pattern: The item-fatigue is not always in-versely correlated with the click probability. The histogramsuggests that there is a threshold number of impressionswhich is required to maximize the click probability on anitem. This threshold varies for different recommender sys-tems and may even change across users and items. This in-sight indicates the need for a principled approach to accountfor item fatigue when optimizing for list recommendation.

TimeSlot 1 2 3 4 5 6% of Users 26.98 25.61 23.62 19.23 18.42 12.51

Table 1: Percentage of users with significantlyhigher CTR (compared to average CTR acrosswhole day ) in various time slots. Note that a sub-set of users could feature under multiple time slotswhere others might not feature in any timeslot.

4.3 CTR Variations by Time of DayClick probabilities vary throughout the day and peak at

different times for different users. Existing work in handlingtemporal dynamics for collaborative filtering either studylong term temporal effects or use heuristics to optimize forsession diversity [5, 12, 32]. In Table 1, we study the time-of-day click patterns of Xbox users. We split days into sixtime slots of four hours and consider the CTR per each time-slot. We count the number of users in each time-slot whoseCTR for that time-slot is at least 30% higher than average.Note that for some users, there may be more than one time-slot in which their CTR is significantly higher than theiraverage CTR. On the other hand, there could be anothersubset of users for whom there is no significant increase inCTR in any time-slot. Table 1 shows that many users haveone or more preferred time-slots in which their consumptionof recommendations are considerably higher than in othertime-slots. This pattern can and should be exploited byrecommender systems to optimize the timing and repetitionof specific items.

5. METHODOLOGYThe insights described in Section 4 motivate a secondary

supervised learning layer aimed specifically at optimizingCTR for recommendation lists. The datasets described inSection 3 form a classification problem consisting of list im-pressions and their consequent “click” or “no-click” labels.The second layer utilizes predictions from a first collabo-rative filtering layer together with many other potentiallyinformative features in order to learn the click propensity ofa list. Formally, we associate each impression in our datasetwith a context feature vector denoted by xtul, where u, l andt represent the impression’s user, list and time respectively.We denote by ytul the binary label, where ytul = 1 meansthat user u was presented with l at time t and clicked on atleast one item from l, and ytul = −1 means otherwise. Givena dataset D of feature-label tuples (xtul, y

tul) ∈ D we learn

a predictor h(xtul) = ytul that generalizes user-list temporalpredictions denoted by ytul ∈ [−1, 1]. In Section 5.1 we ex-plain this supervised model and in Section 5.2 we discusshow our contextual features xtul are constructed. Finally,in Section 5.3 we describe an effective list recommendationpolicy, denoted by Φ, which utilizes the ytul predictions toconstruct and recommend lists to users.

5.1 Gradient Boosted TreesOur secondary layer utilizes Gradient Boosted Trees (GBT)

[7] to solve the list recommendation problem. GBT is an en-semble method [4] which constructs a classifier

ytul = hM (xtul) =

M∑m=1

gm(xtul) (1)

66

by sequentially adding M “weak” base estimators denotedby gm. Formally, at stage 1 ≤ m ≤ M the full estimator ishm−1 =

∑m−1i=1 gi and a new base estimator gm is added so

that the new full estimator becomes hm = hm−1 + gm. Thebase estimators are chosen to fit the pseudo-residual∑

(xtul,ytul

)∈D

∂L(ytul, hm−1(xtul))

∂hm−1(xtul),

where L is the binomial deviance loss function L(y, f(x)) =log(1 + exp(−2yf(x))) as in [7]. Each base estimator is atree that partitions the input space into J disjoint regions.The number of base estimators M and the number of leavesJ were determined by means of cross validation and set toM = 98 and J = 16.

In Section 4 we showed the existence of complex featureinteractions as well as non-linearities with respect to theclick probability. For the task of list recommendation, wechoose the GBT algorithm because it can easily detect andexploit these patterns without the need to carefully craft akernel. In Section 7, we illustrate this advantage by compari-son of GBT to linear regression and Support Vector Machine(SVM) classifiers.

5.2 Feature ExtractionRecall that given a specific user u at time t our goal is

to produce an ordered personalized list of K items lut ={i1, i2, . . . , iK}. Our second layer learns the click propensityof a list lut using a contextual vector xtul. Here we considerxtul and the features used to construct it.

“First Layer” or Collaborative Filtering.Foremost of our features are the predicted ratings of the

list’s items as generated by our first layer (the collabora-tive filtering layer). We denote by ruj the predicted ratingfor a specific user u on the item at position j. The firstK features in xtul are the ratings to each item. Namely,xtul = (ru1, ru2, . . . , ruK , . . .). Note that in contrast to mostranking algorithms this formulation doesn’t assume a de-scending order on the items and allows the model to learnany possible order. For example, for horizontal recommen-dation lists we may learn that the best items should beplaced at middle of the list rather than at the beginning.

In this work the ruj values are the probabilities for theuser to purchase the item given by a probabilistic matrixfactorization model as explained [17]. Nevertheless, the ap-proach in this paper is not limited to any specific type of CFmodel and can even benefit from integrating several differentCF models. With these features the contextual vector xtulgrows linearly with K. Generally, the number of items thatcan be simultaneously presented to a user is limited and thelist’s length K is not too long. In our case K = 2 for themain dash list and K = 20 for the recommendations dashlist.

Similarity/ Diversity Features.As shown in Section 4.1, CTR depends on interaction ef-

fects between predicted item rating, item similarities andlist position. We can learn these complex interactions byencoding item-to-item similarity features in xtul. In this pa-per we used the Jaccard similarity between item i to item j

given by simi,j =|Ui∩Uj ||Ui∪Uj |

, where Ui and Uj are the sets

of users that purchased items i and j respectively. Thetotal number of item-to-item similarities grows quadrati-

cally in K, but using feature selection techniques we foundthat for our purposes, similarities between adjacent itemsin the list are sufficient. Hence, the number of featuresstays linear in K and the next K − 1 features in xtul are:xtul = (ru1, . . . , ruK , sim1,2, sim2,3, . . . , simK−1,K , . . .).

Item Fatigue.The relationship between CTR and item fatigue is dis-

cussed in Section 4.2. In xtul we employ two types of fatiguefeatures: For a user u and item i at time t we (a) countthe number of times that u has been exposed to i in theprevious week, denoted by ctui, and (b) measure the timeduration (in minutes) since i’s last impression to u, denotedby mt

ui. The fatigue features for each item in l are includedin xtul according to the item’s position in a similar fashionto the previous features.

Trendiness and Temporal Features.Typically, there is an increased public interest in recently

released items. This phenomenon can be attributed to ex-ternal advertisement campaigns and word-of-mouth “viral”effects. We therefore denote by di the number of days sinceitem i’s release date and include these features in xtul foreach item in the list according to its position. Based on ourexperience, we also know that user behavior tend to changebetween weekdays and weekends [2]. Hence, we included ad-ditional is weekendt binary features that indicates whethertime t is a weekend or not. Finally, as discussed in Sec-tion 4.3, many users have a preferred time of day in whichthey are more likely to consume a recommendation. By di-viding the day into six slots, each representing a period offour hours, we find users “favorite” time-slots as discussed inSection 4.3 and include a binary is best timeslottu featurefor the user u at time t.

User Features.We included features that aggregate a user’s historical be-

havior by looking at the items the user clicked on in the pastyear and extracting user-specific features. With slight over-loading of notation we denote by du the average number ofdays since an item’s release date and the date in which useru clicked on it. Additionally, for each item i in the list wedenote by pdiu the price difference between item i and theaverage price of items that u has purchased in the past. Weinclude pdiu for each item and du in xtul.

Popularity Features.We also consider items’ aggregate characteristics averaged

over the past month. We denote by ctri and pci, item i’smonthly average CTR and daily purchase count, respec-tively, and include these in xtul for each item in the list.

5.3 List Recommendation PolicySo far we have described a method to estimate the click

propensity ytul of a list based on its characterizing featurevector xtul. We will now describe a recommendation policywhich will utilize these predictions in order to choose theactual recommendation lists to be presented to users. A de-terministic policy might present each user u with the list thatmaximizes ytul in Equation (1). However, this “exploitation”strategy lacks any freshness and may result in very similarlists being repeatedly presented to the user. The fatiguefeatures will prevent the lists from being identical, yet wewish to facilitate more control over diversity and possibly

67

some exploration. Hence, we define a stochastic list recom-mendation policy that utilizes the ytul predictions in orderto generate better recommendation lists. We denote our listrecommendation policy by Φ, and φtu(l) denotes the proba-bility that user u will be presented a list l at time t accordingto policy Φ. We denote Xbox’s current production policy byΠ, and πu(l) denotes the probability that user u will be pre-sented a list l according to Π. Note that Π’s list probabilitiesare not time dependent.

We begin by defining a naive policy Φ using the Boltz-mann distribution over the lists as follows: When user urequires a recommendation list at time t, our policy Φ willsample a list l from the set of all possible lists L accordingto probabilities φtu(l) given by

φtu(l) =eαy

tul∑

k∈L eαytuk

. (2)

The non-negative parameter α controls exploitation withα = 0 for uniform sampling and α → +∞ for maximalexploitation.

According to the definition of Φ in Equation 2, each possi-ble list (i.e., each combination of items) has some probabilityto be presented to the user given by φtu(l). Therefore, thisinitial formulation extrapolates predictions over many unob-served list combinations. On the other hand, the productionpolicy Π employs several heuristics and business rules thatfilter many items and considers only a subset of lists for eachuser. For example, by design, Π will never recommend alist containing individual items that we predict the user willdislike. This is one example of many similar business rulesheuristically employed by Π and optimized using online ex-perimentations. Reviewing the business rules employed byour production policy Π is sensitive and regardless it can-not be covered in the context of this paper due to spacelimitations.

It is clear that the extrapolation defined in Equation 2includes many cases of bad recommendation lists. However,our training data is based on the policy Π and doesn’t havesupport over the entire feature space. Therefore, this ex-trapolation includes many unreliable predictions that shouldnot be considered at all. Moreover, the extrapolation inEquation 2 prohibits any reasonable estimation of Φ’s per-formance using the off-policy evaluation techniques that wedevelop in Section 6.2. Therefore, we constrain Φ to use sim-ilar heuristics as Π which will prevent it from extrapolatingpredictions over lists that were never observed in the past.We denote by Lu ⊂ L the subset of lists that user u wasexposed to by Π, and alter our definition of Φ to consideronly lists in Lu as follows:

φtu(l) =

eαytul∑

k∈Lu eαytuk, if l ∈ Lu

0, otherwise(3)

Now, policy Φ is constrained to present each user a list fromthe same subset of lists as Π would have shown her. Thischange serves two goals: First, it prevents the system from“drifting” away and suggesting poor recommendation listsdue to unreliable extrapolation. Second, the off-policy eval-uation technique we employ in Section 6.2 requires that theevaluated policy will have the same probability support overlists as the production policy. The definition in Equation 3enables us to evaluate Φ using logs from Π.

6. OFFLINE VS. OFF-POLICY EVALUATIONWe use classical offline evaluation techniques in order to

evaluate the ability of the GBT model to generalize the sec-ond layer learning task. Offline evaluation is important inorder to quantify and optimize the learning task, however itdoes not provide a method to estimate the potential effecton the system’s CTR. Therefore, in Section 6.2 we presenta novel adaptation of Inverse Propensity Scoring (IPS) [25]in order to estimate the CTR if policy Φ was to replace theproduction policy Π.

6.1 Offline EvaluationAs explained above, we utilize predictions from our “first

layer” MF model together with additional features in orderto create a contextual feature vector xtul. Our “second layer”uses these contextual vectors xtul in order to predict if alist will be clicked or not. Our offline evaluation is basedon comparing the Receiver Operating Characteristic (ROC)curves of different classifiers and measuring the Area Underthe Curve (AUC). Results are presented in Section 7.1.

6.2 Off-Policy EvaluationIn order to evaluate a recommendation policy like Φ, one

may implement the policy and perform a controlled experi-ment (A/B testing). However, implementing a new recom-mender system on Xbox, as on any other real-world largescale system, is expensive and time consuming. Hence, wewish to have an evaluation of Φ using a dataset D generatedby the current production policy or logging policy Π. Wepresent here a novel adaptation of Inverse Propensity Scor-ing (IPS) [25] designed to estimate Φ’s improvement over Πusing off-policy techniques.

We denote by EutΦ [ytu] the expected CTR for user u at timet under policy Φ. Note that Φ is a stateful policy i.e., theprobabilities φtu(l) depend on the user and change with time.We therefore wish to compute an expectation of EutΦ [ytu] overthe time interval [0, T ]. Let Au be a user impression eventwhich occurs every time user u appears in the system. Wedenote by p(Au|t) the probability of an impression Au giventhe time t. It can be shown that the expected CTR for useru over [0, T ] is equivalent to the expected number of clicksin [0, T ] divided by the expected number of impressions:

EuΦ[ytu]

∫ Tt=0

p(Au|t)EutΦ [ytu] dt∫ Tt=0

p(Au|t) dt. (4)

Next, we estimate EuΦ[ytu]

as follows:

EuΦ[ytu] =

∫p(Au|t)EutΦ [ytu] dt∫

p(Au|t) dt=

∫p(Au|t)EutΠ

[φtu(l)ytulπtu(l)

]dt∫

p(Au|t)dt

= EuΠ

[φtu(l)ytulπu(l)

]≈ 1

|Du|∑

ytul∈Du

φtu(l)ytulπu(l)

, (5)

where Du are user’s u ratings in the dataset D.The second equality in Equation 5 follows from:

EutΦ [ytu] =∑l∈Lu

φtu(l)ytul =∑l∈Lu

πu(l)φtu(l)ytulπu(l)

= EutΠ

[φtu(l)ytulπu(l)

].

(6)

Equation 6 is key for the off-policy evaluation as it allowsreplacing an expectation over Φ with an expectation over the

68

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

True

Pos

itive

Rat

e

GBT (0.9015)Kernel SVM (0.7823)Kernel LR (0.7696)LR (0.7591)SVM (0.7303)

(a) Main Dash

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

True

Pos

itive

Rat

e

GBT(0.7212)Kernel SVM (0.6764)Kernel LR (0.6685)LR (0.6659)SVM (0.6503)

(b) Recommendations Dash

Figure 4: ROC Curves for different classification algorithms. The AUC values is noted in the legend. AllAUC measures have significance levels (p-values) ≤ 0.0001.

logging policy Π. This equation is valid only if both policieshave the same probability support over lists. By design thisis indeed the case.

The third equality in Equation 5 follows from using thepolicy Π in the definition given by Equation 4. It is trueunder the reasonable assumption that the probability for animpression event p(Au|t) is independent of the recommenda-tion policy. Finally, the last equality in Equation 5 is simply

an empirical estimate of the expectation EuΠ

[φtu(l)ytulπu(l)

].

Relation to Inverse Propensity Scoring.Our estimate of EuΦ[ytu] bears resemblance to the Inverse

Propensity Scoring (IPS) of [25]. In fact, it can be consid-ered as a specific case of IPS. However, despite reaching avery similar result, both papers employ different derivationsin order to justify it. Moreover, our version differs from [25]in several other key aspects: First, in our case the loggingpolicy Π has sufficient support over all lists by way of design.When compared to the IPS description in [25], this meansthat we may drop the max operation in [25]’s estimator anduse τ = 0 (Equation 2 in [25]). Furthermore, we have famil-iarity of the logging policy and assume accurate estimatesof its probabilities πu(l). Hence, the “regret” term given byEquation 5 in [25] is assumed to be zero and our estimatesare guaranteed to be unbiased. In both versions the loggingpolicy is time independent, however in [25] the evaluatedpolicy is deterministic and in our case we evaluate a non-deterministic stateful policy. While this extension does notincur any estimation bias, we do not provide any boundson its accuracy and convergence with respect to the datasetsize. In fact, the per-user EuΦ[ytu] estimates are potentiallyvery noisy. However, as we show next, the final measureswe report are stable and overcome this hurdle by averagingthe EuΦ[ytu] estimates of 1.4 million users.

7. RESULTSWe present here the offline and off-policy evaluation re-

sults of our method using the datasets from Xbox’s maindash and recommendations dash.

7.1 Offline ResultsAs explained in Section 5.1, we predict the click propen-

sity ytul from a context vector xtul using a GBT classifier.Wecompare this classifier to baselines employing Logistic Re-gression (LR) and Support Vector Machines (SVM). In theirbasic form, these are linear models which are inferior to GBTin their ability to capture complex interactions in the inputspace. We therefore add an evaluation using a kernelizedversion of these baselines utilizing a polynomial kernel withdegree d = 4.

Figure 4 depicts the ROC curves of the different algo-rithms. The corresponding AUC values are noted in thelegend. In both the main dash and the recommendationsdash GBT is the most appropriated for the prediction taskat hand. The kernelized SVM algorithm has the best perfor-mance out of the baselines. In general, the kernelized versionof each baseline always outperform the linear version. Thisimplies the existence of non-linear complex interactions be-tween features in the input space. The prediction task seemssomewhat harder in the recommendations dash where AUCvalues are lower compared to the main dash. However as weshow next, both the main dash and the recommendationsdash can benefit dramatically by employing the method de-scribed in this paper which will yield a significant increasein CTR.

7.2 Off-policy ResultsOur goal is to predict the CTR improvement of the Xbox

system if we were to use Φ instead of Π. We estimate theoverall CTR by aggregating the per-user estimates accordingto their relative share in the dataset as follows:

CTRΦ =1

|D|∑u∈U

|Du| · EuΦ[ytu], (7)

where U is the set of all users in our dataset D. The mea-sure CTRΦ is an estimate of the overall CTR using Φ. Nat-urally, CTRΦ emphasizes heavy users who drive more clicksthan others. It is common practice in recommender systemsresearch to define measures in which all users are weightedequally in the test-set regardless of their activity level. That

69

1.03 1.08

1.45

2.67

3.02

0.99 1.031.20

1.70 1.79

0

0.5

1

1.5

2

2.5

3

3.5

0.5 1 2 10 100

CTR

Lif

t

α

CTR

Avg.CTR

(a) Main Dash

1.161.27

1.50

2.46

2.71

1.06 1.101.19

1.41 1.41

0

0.5

1

1.5

2

2.5

3

0.5 1 2 10 100

CTR

Lif

t

α

CTR

Avg.CTR


Figure 5: CTR lift values for the main dash and the recommendations dash of this paper as a function of α.

was the case in many prominent competitions such as theNetflix Prize [1] and KDD Cup’11 [6]. We therefore definea secondary measure which measures the average CTR per-user:

Avg CTRΦ =1

|U |∑u∈U

EuΦ[ytu]. (8)

We report results in terms of lift over our current produc-

tion system: lift = Estimated CTRProduction CTR . As you may recall from

Section 3, the main dash dataset is extremely unbalanced(very low CTR) and we balanced it by uniformly samplingthe negative examples. In this evaluation we used the unbal-anced version of the test-set in order to get a real estimate ofthe CTR which is also comparable to the current productionCTR. We did not change our training-set and all previousresults and discussions are still valid.

Figure 5 summarizes the CTR lifts on both datasets fordifferent values of α. As expected, the results improve withhigher values of α: At α = 0 the lists are chosen uniformlyand the lift values are lower than 1 indicating a reductionin CTR with respect to the production policy. At α = 100the system is at “full exploitation” mode and the best list (interms of ytul) is almost certainly chosen. When comparingthe main dash to the recommendation dash we see that thelift improvement in the main dash is much better. This isin accordance with the offline results as seen in Figure 4.Finally, we note that the overall CTR values exhibit higherimprovement rates than the per-user values. Our training-set do not balance data instances per user, hence users withmany ratings will have better rating predictions. This ex-plains the higher values for the overall CTR.

Additional Baselines.Thus far we showed significant lift values for the approach

presented in this paper when compared to Xbox’s currentrecommender system. Xbox’s system applies many heuris-tics tightly tuned to optimize its CTR and serves as thefirst baseline for this paper. Here, we wish to compareour approach with additional prominent approaches takenfrom previous studies. Due to the lack of relevant datasets,most previous works in the context of collaborative filter-ing try to heuristically balance between the predicted accu-racy and other factors such as diversity [13, 35]. The cur-rent Xbox system also employs heuristics which are alreadyfinely tuned for CTR optimization. Therefore, it is very

hard to show any significant CTR lift using other heuristicapproaches. However, we found that approaches based onmulti-objective optimization [10, 20] techniques can in factshow improvement on top of the current system.

The first baseline utilizes constrained multi-objective op-timization. It attempts to maximize the predicted ratingssubject to constraints on the relevant features. The fea-tures are the same as in the context vector xtul except forthe first K rating features which are used in the objectiveof the optimization. Formally, for a specific user u at timet, let f li be the i’th feature in xtul after the first K ratingfeatures. Namely, xtul = (ru1, ru2, . . . , ruK , f

l1, f

l2, . . .)

2. Ourfirst baseline finds a list l of K items that optimizes thefollowing objective:

maxl∈Du

K∑k=1

ruk

s.t. ∀i : Tmini ≤ f li ≤ Tmaxi , (9)

where Tmini and Tmaxi are minimal and maximal thresholdsfor the i’th feature. A second baseline is based on maximiz-ing the predicted ratings as well as a weighted sum of thefeatures as follows:

maxl∈Du

K∑k=1

ruk +∑i

wifli . (10)

The threshold values Tmini and Tmaxi from Equation 9 andthe weights wi from Equation 10 are parameters that werechosen to optimize the CTR using Nelder-Mead optimiza-tion [16].

Table 2 compares these baselines with the approach pre-sented in this paper. Optimization based on Equation 10consistently yields better results than than Equation 9. Nev-ertheless, the approach presented in this paper outperformsboth baselines. We observe that the main dash lift values arebetter than those of the recommendations dash and that theoverall CTR values exhibit higher improvement rates thanthe per-user values. Both observations are consistent withthose of Figure 5.

Finally, Figure 6 depicts a break down of the lift values forusers with different number of purchased Xbox games (fromthe dataset used for the first layer). The approach presentedin this paper outperforms the baselines in most cases with a

2We suppress the indexes u and t from f li for brevity.

70

0

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35 40

CTR

Lif

t

Items per userThis Paper Equation 9 Equation 10

(a) Main Dash

0

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35 40

CTR

Lif

t

Items per userThis Paper Equation 9 Equation 10


Figure 6: A breakdown of CTR lift values for users with different number of purchased Xbox games.

Main DashThis Paper Equation 9 Equation 10

CTRΦ 3.02 1.28 1.36Avg CTR 1.79 1.16 1.21

Recommendations DashThis Paper Equation 9 Equation 10

CTRΦ 2.71 1.13 1.19Avg CTR 1.41 1.06 1.08

Table 2: Expected CTR and average CTR lift valuesusing the approach proposed in this paper (α = 100)vs. alternative approaches based on multi-objectiveoptimization.

clear trend of improvement as the number of items increase(i.e, for “heavy users”). The matrix factorization model isusing purchases as its training signal hence its predicted rat-ings are more accurate for users with more purchases. Fur-thermore, “heavy users” with more games typically tend tospend more time in front of their consoles and click on morerecommendations. Therefore, click events from these usersare more common in our training-set for the second layerwhich translates to better lift values as discussed earlier.

8. CONCLUSIONSThis paper discusses the list recommendation problem and

proposes a second learning layer on top of traditional CFapproaches. The benefits of our approach are empiricallyillustrated using a novel modification of Inverse PropensityScoring (IPS) . We believe these results showcase the needfor recommendation list optimization and indicate that thereis much room for improvement on top of current state of theart CF techniques.

9. REFERENCES[1] J. Bennett and S. Lanning. The Netflix Prize. In Proc.

KDD Cup and Workshop, 2007.

[2] Haibin Cheng and Erick Cantu-Paz. Personalized clickprediction in sponsored search. In Proceedings of theThird ACM International Conference on Web Searchand Data Mining, WSDM ’10, pages 351–360, 2010.

[3] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin.Performance of recommender algorithms on top-nrecommendation tasks. In Proceedings of the 4th ACM

Conference on Recommender Systems, pages 39–46,2010.

[4] Thomas G. Dietterich. Ensemble methods in machinelearning. In Proceedings of the First InternationalWorkshop on Multiple Classifier Systems, MCS ’00,pages 1–15, 2000.

[5] Gideon Dror, Noam Koenigstein, and Yehuda Koren.Yahoo! music recommendations: Modeling musicratings with temporal dynamics and item taxonomy.In Proceedings of the 5th ACM Conference onRecommender Systems, 2011.

[6] Gideon Dror, Noam Koenigstein, Yehuda Koren, andMarkus Weimer. The Yahoo! music dataset andKDD-Cup’11. In Proceedings of KDDCup 2011, 2011.

[7] Jerome H Friedman. Greedy function approximation:a gradient boosting machine. Annals of Statistics,pages 1189–1232, 2001.

[8] Thore Graepel, Joaquin Q Candela, Thomas Borchert,and Ralf Herbrich. Web-scale bayesian click-throughrate prediction for sponsored search advertising inmicrosoft’s bing search engine. In Proceedings of the27th International Conference on Machine Learning(ICML-10), pages 13–20, 2010.

[9] Y. F. Hu, Y. Koren, and C. Volinsky. Collaborativefiltering for implicit feedback datasets. In IEEEInternational Conference on Data Mining, 2008.

[10] Tamas Jambor and Jun Wang. Optimizing multipleobjectives in collaborative filtering. In Proceedings ofthe Fourth ACM Conference on RecommenderSystems, RecSys ’10, pages 55–62, 2010.

[11] Noam Koenigstein and Ulrich Paquet. Xbox moviesrecommendations: Variational Bayes matrixfactorization with embedded feature selection. InProceedings of the 7th ACM Conference onRecommender Systems, 2013.

[12] Yehuda Koren. Collaborative filtering with temporaldynamics. In Proceedings of the 15th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, KDD ’09, pages 447–456, 2009.

[13] Neal Lathia, Stephen Hailes, Licia Capra, and XavierAmatriain. Temporal diversity in recommendersystems. In Proceedings of the 33rd international ACMSIGIR conference on Research and development ininformation retrieval, pages 210–217, 2010.

71

[14] Lihong Li, Wei Chu, John Langford, and Robert ESchapire. A contextual-bandit approach topersonalized news article recommendation. InProceedings of the 19th International Conference onWorld Wide Web, pages 661–670, 2010.

[15] H. Brendan McMahan, Gary Holt, D. Sculley, MichaelYoung, Dietmar Ebner, Julian Grady, Lan Nie, ToddPhillips, Eugene Davydov, Daniel Golovin, SharatChikkerur, Dan Liu, Martin Wattenberg, Arnar MarHrafnkelsson, Tom Boulos, and Jeremy Kubica. Adclick prediction: A view from the trenches. InProceedings of the 19th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,KDD ’13, pages 1222–1230, 2013.

[16] J. A. Nelder and R. Mead. A simplex method forfunction minimization. The Computer Journal, 7(4),1965.

[17] Ulrich Paquet and Noam Koenigstein. One-classcollaborative filtering with random graphs. InProceedings of the 22nd International Conference onWorld Wide Web, WWW ’13, pages 999–1008, 2013.

[18] Filip Radlinski, Robert Kleinberg, and ThorstenJoachims. Learning diverse rankings with multi-armedbandits. In Proceedings of the InternationalConference on Machine Learning (ICML), 2008.

[19] Steffen Rendle, Christoph Freudenthaler, ZenoGantner, and Lars Schmidt-Thieme. BPR: Bayesianpersonalized ranking from implicit feedback. In UAI’09: Proceedings of the 25th Conference onUncertainty in Artificial Intelligence, 2009.

[20] Mario Rodriguez, Christian Posse, and Ethan Zhang.Multiple objective optimization in recommendersystems. In Proceedings of the Sixth ACM Conferenceon Recommender Systems, RecSys ’12, 2012.

[21] Andrew I. Schein, Alexandrin Popescul, Lyle H.Ungar, and David M. Pennock. Methods and metricsfor cold-start recommendations. In Proceedings of the25th ACM SIGIR International Conference onResearch and Development in Information Retrieval,pages 253–260, 2002.

[22] Benyah Shaparenko, Ozgur Cetin, and Rukmini Iyer.Data-driven text features for sponsored search clickprediction. In Proceedings of the Third InternationalWorkshop on Data Mining and Audience Intelligencefor Advertising, ADKDD ’09, pages 46–54, 2009.

[23] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas,Martha Larson, Nuria Oliver, and Alan Hanjalic.Climf: Learning to maximize reciprocal rank withcollaborative less-is-more filtering. In Proceedings ofthe 6th ACM Conference on Recommender Systems,RecSys ’12, pages 139–146, 2012.

[24] Yue Shi, Martha Larson, and Alan Hanjalic. List-wiselearning to rank with matrix factorization forcollaborative filtering. In Proceedings of the 4th ACMConference on Recommender Systems, pages 269–272,2010.

[25] Alexander L. Strehl, John Langford, Lihong Li, andSham Kakade. Learning from logged implicit

exploration data. In Advances in Neural InformationProcessing Systems 23, pages 2217–2225, 2010.

[26] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. Ensemblecontextual bandits for personalized recommendation.In Proceedings of the 8th ACM Conference onRecommender Systems, pages 73–80, 2014.

[27] Liang Tang, Romer Rosales, Ajit Singh, and DeepakAgarwal. Automatic ad format selection viacontextual bandits. In Proceedings of the 22nd ACMInternational Conference on Information andKnowledge Management, pages 1587–1594, 2013.

[28] Hastagiri P Vanchinathan, Isidor Nikolic, FabioDe Bona, and Andreas Krause. Explore-exploit intop-n recommender systems via gaussian processes. InProceedings of the 8th ACM Conference onRecommender Systems, pages 225–232, 2014.

[29] Jun Wang and Jianhan Zhu. Portfolio theory ofinformation retrieval. In Proceedings of the 32ndinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 115–122.ACM, 2009.

[30] Markus Weimer, Alexandros Karatzoglou, Quoc VietLe, and Alexander J. Smola. Cofi rank - maximummargin matrix factorization for collaborative ranking.In Advances in Neural Information Processing Systems20, pages 1593–1600, 2008.

[31] Liang Xiang, Quan Yuan, Shiwan Zhao, Li Chen,Xiatian Zhang, Qing Yang, and Jimeng Sun. Temporalrecommendation on graphs via long-and short-termpreference fusion. In Proceedings of the 16th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 723–732, 2010.

[32] Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff G.Schneider, and Jaime G. Carbonell. Temporalcollaborative filtering with bayesian probabilistictensor factorization. In Proceedings of the SIAMInternational Conference on Data Mining, SDM 2010,pages 211–222, 2010.

[33] Yuan Cao Zhang, Diarmuid O Seaghdha, DanieleQuercia, and Tamas Jambor. Auralist: Introducingserendipity into music recommendation. In Proceedingsof the Fifth ACM International Conference on WebSearch and Data Mining, WSDM ’12, pages 13–22,2012.

[34] T. Zhou, Z. Kuscsik, J.G. Liu, M. Medo, J.R.Wakeling, and Y.C. Zhang. Solving the apparentdiversity-accuracy dilemma of recommender systems.Proceedings of the National Academy of Sciences(PNAS), 107(10):4511–4515, 2010.

[35] Cai-Nicolas Ziegler, Sean M McNee, Joseph AKonstan, and Georg Lausen. Improvingrecommendation lists through topic diversification. InProceedings of the 14th International Conference onWorld Wide Web, pages 22–32, 2005.

72

beyond collaborative filtering: the list recommendation ... · evaluation. using this evaluation...

Documents