scalable realistic recommendation datasets through fractal ......recommendation data. in order to...

12
Scalable Realistic Recommendation Datasets through Fractal Expansions Francois Belletti Google AI Google [email protected] Karthik Lakshmanan Google AI Google [email protected] Walid Krichene Google AI Google [email protected] Yi-Fan Chen Google AI Google [email protected] John Anderson Google AI Google [email protected] Abstract—Recommender System research suffers currently from a disconnect between the size of academic data sets and the scale of industrial production systems. In order to bridge that gap we propose to generate more massive user/item interaction data sets by expanding pre-existing public data sets. User/item incidence matrices record interactions between users and items on a given platform as a large sparse matrix whose rows correspond to users and whose columns correspond to items. Our technique expands such matrices to larger numbers of rows (users), columns (items) and non zero values (interactions) while preserving key higher order statistical properties. We adapt the Kronecker Graph Theory to user/item incidence matrices and show that the corresponding fractal expansions preserve the fat-tailed distributions of user engagements, item popularity and singular value spectra of user/item interaction matrices. Preserving such properties is key to building large realistic synthetic data sets which in turn can be employed reliably to benchmark Recommender Systems and the systems employed to train them. We provide algorithms to produce such expansions and apply them to the MovieLens 20 million data set comprising 20 million ratings of 27K movies by 138K users. The resulting expanded data set has 10 billion ratings, 864K items and 2 million users in its smaller version and can be scaled up or down. A larger version features 655 billion ratings, 7 million items and 17 million users. Index Terms—Machine Learning, Deep Learning, Recom- mender Systems, Graph Theory, Simulation I. I NTRODUCTION Machine Learning (ML) benchmarks compare the capa- bilities of models, distributed training systems and linear algebra accelerators on realistic problems at scale. For these benchmarks to be effective, results need to be reproducible by many different groups which implies that publicly shared data sets need to be available. Unfortunately, while Recommendation Systems constitute a key industrial application of ML at scale, large public data sets recording user/item interactions on online platforms are not yet available. For instance, although the Netflix data set [5] and the MovieLens data set [18] are publicly available, they are orders of magnitude smaller than proprietary data [3], [11], [54]. Proprietary data sets and privacy: While releasing large anonymized proprietary recommendation data sets may seem an acceptable solution from a technical standpoint, it is a non- trivial problem to preserve user privacy while still maintaining TABLE I: Size of MovieLens 20M [18] vs industrial dataset in [54]. MovieLens 20M Industrial #users 138K Hundreds of Millions #items 27K 2M #topics 19 600K #observations 20M Hundreds of Billions useful characteristics of the dataset. For instance, [36] shows a privacy breach of the Netflix prize dataset. More importantly, publishing anonymized industrial data sets runs counter to user expectations that their data may only be used in a restricted manner to improve the quality of their experience on the platform. Therefore, we decide not to make user data more broadly available to preserve the privacy of users. We instead choose to produce synthetic yet realistic data sets whose scale is commensurable with that of our production problems while only consuming already publicly available data. Producing a realistic MovieLens 10 billion+ dataset: In this work, we focus on the MovieLens dataset which only entails movie ratings posted publicly by users of the MovieLens platform. The MovieLens data set has now become a standard benchmark for academic research in Recommender Systems, [1], [7], [19], [21], [25], [31], [33], [37], [43], [47], [48], [53], [55], [57] are only few of the many recent research articles relying on MovieLens whose latest version [18] has accrued more than 800 citations according to Google Scholar. Unfortunately, the data set comprises only few observed in- teractions and more importantly a very small catalogue of users and items — when compared to industrial proprietary recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender Systems — we aim at expanding publicly available data by creating a realistic surrogate. The following constraints help create a production- size synthetic recommendation problem similar and at least as hard an ML problem as the original one for matrix factoriza- tion approaches to recommendations [19], [24]: orders of magnitude more users and items are present in

Upload: others

Post on 09-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

Scalable Realistic Recommendation Datasetsthrough Fractal Expansions

Francois BellettiGoogle AI

[email protected]

Karthik LakshmananGoogle AI

[email protected]

Walid KricheneGoogle AI

[email protected]

Yi-Fan ChenGoogle AI

[email protected]

John AndersonGoogle AI

[email protected]

Abstract—Recommender System research suffers currentlyfrom a disconnect between the size of academic data sets and thescale of industrial production systems. In order to bridge thatgap we propose to generate more massive user/item interactiondata sets by expanding pre-existing public data sets.

User/item incidence matrices record interactions between usersand items on a given platform as a large sparse matrix whoserows correspond to users and whose columns correspond to items.Our technique expands such matrices to larger numbers of rows(users), columns (items) and non zero values (interactions) whilepreserving key higher order statistical properties.

We adapt the Kronecker Graph Theory to user/item incidencematrices and show that the corresponding fractal expansionspreserve the fat-tailed distributions of user engagements, itempopularity and singular value spectra of user/item interactionmatrices. Preserving such properties is key to building largerealistic synthetic data sets which in turn can be employedreliably to benchmark Recommender Systems and the systemsemployed to train them.

We provide algorithms to produce such expansions and applythem to the MovieLens 20 million data set comprising 20 millionratings of 27K movies by 138K users. The resulting expandeddata set has 10 billion ratings, 864K items and 2 million usersin its smaller version and can be scaled up or down. A largerversion features 655 billion ratings, 7 million items and 17 millionusers.

Index Terms—Machine Learning, Deep Learning, Recom-mender Systems, Graph Theory, Simulation

I. INTRODUCTION

Machine Learning (ML) benchmarks compare the capa-bilities of models, distributed training systems and linearalgebra accelerators on realistic problems at scale. For thesebenchmarks to be effective, results need to be reproducible bymany different groups which implies that publicly shared datasets need to be available.

Unfortunately, while Recommendation Systems constitute akey industrial application of ML at scale, large public datasets recording user/item interactions on online platforms arenot yet available. For instance, although the Netflix data set [5]and the MovieLens data set [18] are publicly available, theyare orders of magnitude smaller than proprietary data [3], [11],[54].

Proprietary data sets and privacy: While releasing largeanonymized proprietary recommendation data sets may seeman acceptable solution from a technical standpoint, it is a non-trivial problem to preserve user privacy while still maintaining

TABLE I: Size of MovieLens 20M [18] vs industrial datasetin [54].

MovieLens 20M Industrial

#users 138K Hundreds of Millions

#items 27K 2M

#topics 19 600K

#observations 20M Hundreds of Billions

useful characteristics of the dataset. For instance, [36] shows aprivacy breach of the Netflix prize dataset. More importantly,publishing anonymized industrial data sets runs counter to userexpectations that their data may only be used in a restrictedmanner to improve the quality of their experience on theplatform.

Therefore, we decide not to make user data more broadlyavailable to preserve the privacy of users. We instead chooseto produce synthetic yet realistic data sets whose scale iscommensurable with that of our production problems whileonly consuming already publicly available data.

Producing a realistic MovieLens 10 billion+ dataset:In this work, we focus on the MovieLens dataset whichonly entails movie ratings posted publicly by users of theMovieLens platform. The MovieLens data set has now becomea standard benchmark for academic research in RecommenderSystems, [1], [7], [19], [21], [25], [31], [33], [37], [43], [47],[48], [53], [55], [57] are only few of the many recent researcharticles relying on MovieLens whose latest version [18] hasaccrued more than 800 citations according to Google Scholar.Unfortunately, the data set comprises only few observed in-teractions and more importantly a very small catalogue ofusers and items — when compared to industrial proprietaryrecommendation data.

In order to provide a new data set — more aligned withthe needs of production scale Recommender Systems — weaim at expanding publicly available data by creating a realisticsurrogate. The following constraints help create a production-size synthetic recommendation problem similar and at least ashard an ML problem as the original one for matrix factoriza-tion approaches to recommendations [19], [24]:

• orders of magnitude more users and items are present in

Page 2: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

050

00

1000

0

1500

0

2000

0

2500

0

3000

0

item rank

5000

0

5000

10000

15000

tota

l ra

ting

Item-wiserating sums

100 101 102 103 104 105

item rank

10-310-210-1100101102103104105

tota

l ra

ting

Item-wiserating sums (log/log)

0

2000

0

4000

0

6000

0

8000

0

1000

00

1200

00

1400

00

user rank

2500

2000

1500

1000

500

0

500

tota

l ra

ting

User-wiserating sums

100 101 102 103 104 105 106

user rank

10-4

10-3

10-2

10-1

100

101

102

103

tota

l ra

ting

User-wiserating sums (log/log)

0 500 1000 1500 2000 2500

singular valuerank

0

50

100

150

200

250

singula

r valu

em

agnit

ude

User/itemrating spectrum

100 101 102 103 104

singular valuerank

101

102

103

singula

r valu

em

agnit

ude

User/itemrating spectrum (log/log)

Fig. 1: Key first and second order properties of the originalMovieLens 20m user/item rating matrix (after centering andre-scaling into [−1, 1]) we aim to preserve while syntheticallyexpanding the data set. Top: item popularity distribution (totalratings of each item). Middle: user engagement distribution(total ratings of each user). Bottom: dominant singular valuesof the rating matrix (core to the difficulty of matrix factor-ization tasks). In all log/log plots the small fraction of non-positive row-wise and column-wise sums are removed.

the synthetic dataset;• the synthetic dataset is realistic in that its first and

second order statistics match those of the original datasetpresented in Figure 1.

Key first and second order statistics of interest we aim topreserve are summarized in Figure 1 — the details of theircomputation are given in Section IV.

Adapting Kronecker Graph expansions to user/itemfeedback: We employ the Kronecker Graph Theory intro-duced in [27] to achieve a suitable fractal expansion ofrecommendation data to benchmark linear and non-linearuser/item factorization approaches for recommendations [19],[24]. Consider a recommendation problem comprising m usersand n items. Let (Ri,j)i=1...m,j=1...n be the sparse matrix ofrecorded interactions (e.g. the rating left by the user i for itemj if any and 0 otherwise). The key insight we develop in thepresent paper is that a carefully crafted fractal expansion of R

can preserve high level statistics of the original data set whilescaling its size up by multiple orders of magnitudes.

Many different transforms can be applied to the matrix Rwhich can be considered a standard sparse 2 dimensional im-age. A recent approach to creating synthetic recommendationdata sets consists in making parametric assumptions on userbehavior by instantiating a user model interacting with anonline platform [10], [45]. Unfortunately, such methods (evencalibrated to reproduce empirical facts in actual data sets) donot provide strong guarantees that the resulting interactiondata is similar to the original. Therefore, instead of simulatingrecommendations in a parametric user-centric way as in [10],[45], we choose a non-parametric approach operating directlyin the space of user/item affinity. In order to synthesize alarge realistic dataset in a principled manner, we adapt theKronecker expansions which have previously been employedto produce large realistic graphs in [27]. We employ a non-parametric analytically tractable simulation of the evolution ofthe user/item bi-partite graph to create a large synthetic dataset. Our choice is to trade-off realism for analytic tractability.We emphasize the latter.

While Kronecker Graphs Theory is developed in [27], [28]on square adjacency matrices, the Kronecker product operatoris well defined on rectangular matrices and therefore we canapply a similar technique to user/item interaction data sets —which was already noted in [28] but not developed extensively.The Kronecker Graph generation paradigm has to be changedwith the present data set in other aspects however: we needto decrease the expansion rate to generate data sets with thescale we desire, not orders of magnitude too large. We needto do so while maintaining key conservation properties of theoriginal algorithm [28].

In order to reliably employ Kronecker based fractal expan-sions on recommender system data we devise the followingcontributions:

• we develop a new technique based on linear algebra toadapt fractal Kronecker expansions to recommendationproblems;

• we demonstrate that key recommendation system specificproperties of the original dataset are preserved by ourtechnique;

• we also show that the resulting algorithm we develop isscalable and easily parallelizable as we employ it on theactual MovieLens 20 million dataset;

• we produce a synthetic yet realistic MovieLens 655billion dataset to help recommender system research scaleup in computational benchmark for model training.

The present article is organized as follows: we first reca-pitulate prior research on ML for recommendations and largesynthetic dataset generation; we then develop an adaptation ofKronecker Graphs to user/item interaction matrices and provekey theoretical properties; finally we employ the resultingalgorithm experimentally to MovieLens 20m data and validateits statistical properties.

Page 3: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

II. RELATED WORK

Recommender Systems constitute the workhorse of many e-commerce, social networking and entertainment platforms. Inthe present paper we focus on the classical setting where thekey role of a recommender system is to suggest relevant itemsto a given user. Although other approaches are very popularsuch as content based recommendations [42] or social rec-ommendations [9], collaborative filtering remains a prevalentapproach to the recommendation problem [32], [41], [44].

Collaborative filtering: The key insight behind collabora-tive filtering is to learn affinities between users and items basedon previously collected user/item interaction data. Collabora-tive filtering exists in different flavors. Neighborhood methodsgroup users by inter-user similarity and will recommend itemsto a given user that have been consumed by neighbors [42].Latent factor methods such as matrix factorization [24] try todecompose user/item affinity as the result of the interaction ofa few underlying representative factors characterizing the userand the item. Although other latent models have been devel-oped and could be used to construct synthetic recommendationdata sets (e.g. Principal Component Analysis [22] or LatentDirichlet Allocation [8]), we focus on insights derived frommatrix factorization.

The matrix factorization approach represents the affinityai,j between a user i and an item j with an inner productxTi yj where xi and yj are two vectors in Rd representing

the user and the item respectively. Given a sparse matrix ofuser/item interactions R = (ri,j)i=1...m,j=1...n, user and itemfactors can therefore be learned by approximating R with alow rank matrix XY T where X ∈ Rm,k entails the userfactors and Y ∈ Rn,k contains the item factors. The dataset R represents ratings as in the MovieLens dataset [18]or item consumption (ri,j = 1 if and only if the user ihas consumed item j [5]). The matrix factorization approachis an example of a solution to the rating matrix completionproblem which aims at predicting the rating of an item j bya user i which has not been observed yet and correspondsto a value of 0 in the sparse original rating matrix. Such afactorization method learns an approximation of the data thatpreserves a few higher order properties of the rating matrixR. In particular, the low rank approximation tries to mimicthe singular value spectrum of the original data set. We drawinspiration from matrix factorization to tackle synthetic datageneration. The present paper will adopt a similar approachto extend collaborative filtering data-sets. Besides trying topreserve the spectral properties of the original data, we operateunder the constraint of conserving its first and second orderstatistical properties.

Deep Learning for Recommender Systems: Collaborativefiltering has known many recent developments which motivateour objective of expanding public data sets in a realisticmanner. Deep Neural Networks (DNNs) are now becomingcommon in both non-linear matrix factorization tasks [11],[19], [50] and sequential recommendations [17], [46], [56].The mapping between user/item pairs and ratings is generally

learned by training the neural model to predict user behavioron a large data set of previously observed user/item interac-tions.

DNNs consume large quantities of data and are com-putationally expensive to train, therefore they give rise tocommonly shared benchmarks aimed at speeding up train-ing. For training, a Stochastic Gradient Descent method isemployed [26] which requires forward model computationand back-propagation to be run on many mini-batches of(user, item, score) examples. The matrix completion task stillconsists in predicting a rating for the interaction of user i anditem j although (i, j) has not been observed in the originaldata-set. The model is typically run on billions of examplesas the training procedure iterates over the training data set.

Model freshness is generally critical to industrial recom-mendations [11] which implies that only limited time isavailable to re-train the model on newly available data. Thethroughput of the trainer is therefore crucial to providing moreengaging recommendation experiences and presenting morenovel items. Unfortunately, public recommendation data setsare too small to provide training-time-to-accuracy benchmarksthat can be realistically employed for industrial applications.Too few different examples are available in MovieLens 20mfor instance and the number of different available items isorders of magnitude too small. In many industrial settings,millions of items (e.g. products, videos, songs) have to betaken into account by recommendation models. The recom-mendation model learns an embedding matrices of size (N, d)where d ∼ 10 − 103 and N ∼ 106 − 109 are typical values.As a consequence, the memory footprint of this matrix maydominate that of the rest of the model by several ordersof magnitude. During training, the latency and bandwidthof the access to such embedding matrices have a prominentinfluence on the final throughput in examples/second. Suchcomputational difficulties associated with learning large em-bedding matrices are worthwhile solving in benchmarks. Ahigher throughput enables training models with more exampleswhich enables better statistical regularization and architecturalexpressiveness. The multi-billion interaction size of the dataset used for training is also a major factor that affects modelingchoices and infrastructure development in the industry.

Our aim is therefore to enable a comparison of modelingapproaches, software engineering frameworks and hardwareaccelerators for ML in the context of industry scale recom-mendations. In the present paper we focus on enabling a betterevaluation of examples/sec throughput and training-time-to-accuracy for Neural Collaborative Filtering Approaches [19]and Matrix Factorization Approaches [24]. A major issuewith this approach is, as we mentioned, the size of publiclyavailable collaborative filtering data sets which is orders ofmagnitude smaller than production grade data (see Table I)and has mis-representative orders of magnitudes in terms ofnumbers of distinct users and items. The present paper offersa first solution to this problem by providing a simple andtractable non-parametric fractal approach to scaling up publicrecommendation data sets by several orders of magnitude.

Page 4: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

User groups

Item groups

(1,1) (1,2) (1,3)

(2,2)(2,1) (2,3)

(3, 2) (3, 3) (3, 4)

(4, 4)

User/item interaction patterns

User/item interaction

matrix

Fig. 2: Typical user/item interaction patterns in recommenda-tion data sets. Self-similarity appears as a natural key featureof the hierarchical organization of users and items into groupsof various granularity.

Such data sets will help build a first set of benchmarks formodel training accelerators based on publicly available data.We plan to publish the expanded data. However, MovieLens20m is publicly available and our method can already beapplied to recreate such an expanded data set. Meta-datais also common in industrial applications [3], [6], [11] butwe consider its expansion outside the scope of this firstdevelopment.

III. FRACTAL EXPANSIONS OF USER/ITEM INTERACTIONDATA SETS

The present section delineates the insights orienting ourdesign decisions when expanding public recommendation datasets.

1) Self-similarity in user/item interactions: Interactions be-tween users and items follow a natural hierarchy in data setswhere items can be organized in topics, genres, categoriesetc [54]. There is for instance an item-level fractal structurein MovieLens 20m with a tree-like structure of genres, sub-genres, and directors. If users were clustered according totheir demographics and tastes, another hierarchy would beformed [42]. The corresponding structured user/item interac-tion matrix is illustrated in Figure 2. The hierarchical natureof user/item interactions (topical and demographic) makes therecommendation data set structurally self-similar (i.e. patternsthat occur at more granular scales resemble those affectingcoarser scales [35]).

One can therefore build a user-group/item-category inci-dence matrix with user-groups as rows and item-categoriesas columns — a coarse interaction matrix. As each user groupconsists of many individuals and each item category com-prises multiple movies, the original individual level user/iteminteraction matrix may be considered as an expanded version

of the coarse interaction matrix. We choose to expand theuser/item interaction matrix by extrapolating this self-similarstructure and simulating its growth to yet another level ofgranularity: each original item is treated as a synthetic topicin the expanded data set and each actual user is considered afictional user group.

A key advantage of this fractal procedure is that it maybe entirely non-parametric and designed to preserve highlevel properties of the original dataset. In particular, a fractalexpansion re-introduces the patterns originally observed in theentire real dataset within each block of local interactions ofthe synthetic user/item matrix. By carefully designing the waysuch blocks are produced and laid out, we can therefore hopeto produce a realistic yet much larger rating matrix. In thefollowing, we show how the Kronecker operator enables sucha construction.

2) Fractal expansion through Kronecker products: TheKronecker product — denoted ⊗ — is a non-standard matrixoperator with an intrinsic self-similar structure:

A⊗B =

a11B . . . a1nB...

. . ....

am1B . . . amnB

(1)

where A ∈ Rm,n, B ∈ Rp,q and A⊗B ∈ Rmp,nq .In the original presentation of Kronecker Graph Theory [27]

as well as the stochastic extension [34] and the extended the-ory [28], the Kronecker product is the core operator enablingthe synthesis of graphs with exponentially growing adjacencymatrices. As in the present work, the insight underlyingthe use of Kronecker Graph Theory in [28] is to producelarge synthetic yet realistic graphs. The fractal nature of theKronecker operator as it is applied multiple times (see Figure2 in [28] for an illustration) fits the self-similar statisticalproperties of real world graphs such as the internet, the webor social networks [27].

If A is the adjacency matrix of the original graph, fractalexpansions are created in [27] by chaining Kronecker productsas follows:

A⊗A . . . A.

As adjacency matrices are square, Kronecker Graphs arenot employed on rectangular matrices in pre-existing workalthough the operation is well defined. Another slight diver-gence between present and pre-existing work is that — intheir stochastic version — Kronecker Graphs carry Bernouilliprobability distribution parameters in [0, 1] while MovieLensratings are in {0.5, 1.0, . . . , 5.0} originally and [−1, 1] after wecenter and rescale them. We show that these differences do notprevent Kronecker products from preserving core propertiesof rating matrices. A more important challenge is the size ofthe original matrix we deal with: R ∈ R(138×103,27×103). Anaive Kronecker expansion would therefore synthesize a ratingmatrix with 19 billion users which is too large.

Thus, although Kronecker products seem like an ideal candi-date for the mechanism at the core of the self-similar synthesis

Page 5: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

of a larger recommendation dataset, some modifications areneeded to the algorithms developed in [28].

3) Reduced Kronecker expansions: We choose to synthe-size a user/item rating matrix

R = R⊗R

where R is a matrix derived from R but much smaller (forinstance R ∈ R128,256). For reasons that will become apparentas we explore some theoretical properties of Kronecker fractalexpansions, we want to construct a smaller derived matrix Rthat shares similarities with R. In particular, we seek R with asimilar row-wise sum distribution (user engagement distribu-tion), column-wise distribution (item engagement distribution)and singular value spectrum (signal to noise ratio distributionin the matrix factorization).

4) Implementation at scale and algorithmic extensions:Computing a Kronecker product between two matrices A andB is an inherently parallel operation. It is sufficient to broad-cast B to each element (i, j) of A and then multiply B by ai,j .Such a property implies that scaling can be achieved. Anotheradvantage of the operator is that even a single machine canproduce a large output data-set by sequentially iterating on thevalues of A. Only storage space commensurable with the sizeof the original matrix is needed to compute each block of theKronecker product. It is noteworthy that generalized fractalexpansions can be defined by altering the standard Kroneckerproduct. We consider such extensions here as candidates toengineer more challenging synthetic data sets. One drawbackthough is that these extensions may not preserve analytictractability.

A first generalization defines a binary operator ⊗F withF : R× Rm,n × N→ Rp,q as follows:

A⊗F B =

F (a11, B, ω11) . . . F (a1n, B, ω1n)...

. . ....

F (am1, B, ωm1) . . . F (amn, B, ωmn)

(2)

where ω11, . . . , ωmn is a sequence of pseudo-random num-bers. Including randomization and non-linearity in F appearsas a simple way to synthesize data sets entailing more variedpatterns. The algorithm we employ to compute Kroneckerproducts is presented in Algorithm 1. The implementation weemploy is trivially parallelizable. We only create a list of Kro-necker blocks to dump entire rows (users) of the output matrixto file. This is not necessary and can be removed to enableas many processes to run simultaneously and independentlyas there are elements in R (provided pseudo random numbersare generated in parallel in an appropriate manner).

The only reason why we need a reduced version R of R isto control the size of the expansion. Also, A⊗B and B ⊗Aare equal after a row-wise and a column-wise permutation.Therefore, another family of appropriate extensions may beobtained by considering

B ⊗G A =

b11G(A,ω11) . . . b1nG(A,ω1n)...

. . ....

bm1G(A,ωm1) . . . bmnG(A,ωmn)

(3)

Algorithm 1 Kronecker fractal expansion

for i = 1 to m dokBlocks ← empty listfor j = 1 to n doω ← next pseudo random numberkBlock ← F (R(i, j), R,w)kBlocks append kBlock

end foroutputToFile(kBlocks)

end for

where G : Rm,n × N → Rm′,n′is a randomized sketching

operation on the matrix A which reduces its size by severalorders of magnitude. A trivial scheme consists in samplinga small number of rows and columns from A at random.Other random projections [2], [14], [30] may of course beused. The randomized procedures above produce a user/iteminteraction matrix where there is no longer a block-wiserepetitive structure. Less obvious statistical patterns can giverise to more challenging synthetic large-scale collaborativefiltering problems.

IV. STATISTICAL PROPERTIES OF KRONECKER FRACTALEXPANSIONS

After having introduced Kronecker products to self-similarly expand a recommendation dataset into a much largerone, we now demonstrate how the resulting synthetic user/iteminteraction matrix shares crucial common properties with theoriginal.

1) Salient empirical facts in MovieLens data: First, weintroduce the critical properties we want to preserve. Note thatthroughout the paper we present results on a centered versionof MovieLens 20m. The average rating of 3.53 is subtractedfrom all ratings (so that the 0 elements of the sparse ratingmatrix match un-observed scores and not bad movie ratings).Furthermore, we re-scale the centered ratings so that they areall in the interval [−1, 1]. As a user/item interaction dataset onan online platform, one expects MovieLens to feature commonproperties of recommendation data sets such as “power-law”or fat-tailed distributions [54].

First important statistical properties for recommendationsconcern the distribution of interactions across users and acrossitems. It is generally observed that such distributions exhibita “power-law” behavior [1], [13], [15], [29], [38], [39], [51],[54]. To characterize such a behavior in the MovieLens dataset, we take a look at the distribution of the total ratingsalong the item axis and the user axis. In other words, wecompute row-wise and column-wise sums for the rating matrixR and observe their distributions. The corresponding rankeddistributions are exposed in Figure 1 and do exhibit a clear“power-law” behavior for rather popular items. However weobserve that tail items have a higher popularity decay rate.Similarly, the engagement decay rate increases for the groupof less engaged users.

Page 6: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

The other approximate “power-law” we find in Figure 1lies in the singular value spectrum of the MovieLens dataset.We compute the top k singular values [20] of the MovieLensrating matrix R by approximate iterative methods (e.g. poweriteration) which can scale to its large (138K, 27K) dimension.The method yields the dominant singular values of R andthe corresponding singular vectors so that one can classicallyapproximate R by R ' UΣV where Σ is diagonal ofdimension (k, k), U is column-orthogonal of dimension (m, k)and V is row-orthogonal of dimension (k, n) — which yieldsthe rank k matrix closest to R in Frobenius norm.

Examining the distribution of the 2048 top singular valuesof R in the MovieLens dataset (which has at most 27K non-zero singular values) in Figure 1 highlights a clear “power-law” behavior in the highest magnitude part of the spectrumof R. We observe in the spectral distribution an inflection forsmaller singular values whose magnitude decays at a higherrate than larger singular values. Such a spectral distribution isas a key feature of the original dataset, in particular in that itconditions the difficulty of low-rank approximation approachesto the matrix completion problem. Therefore, we also wantthe expanded dataset to exhibit a similar behavior in terms ofspectral properties.

In all the high level statistics we present, we want topreserve the approximate “power-law” decay as well as itsinflection for smaller values. Our requirements for the expand-ing transform which we apply to R are therefore threefold:we want to preserve the distributions of row-wise sums ofR, column-wise sums of R and singular value distributionof R. Additional requirements, beyond first and second orderhigh level statistics will further increase the confidence in therealism of the expanded synthetic dataset. Nevertheless, weconsider that focusing on these first three properties is a goodstarting point.

It is noteworthy that we do not consider here the temporalstructure of the MovieLens data set. We leave the study ofsequential user behavior — often found to be Long RangeDependent [4], [12], [40] — and the extension of syntheticdata generation to sequential recommendations [3], [47], [52]for further work.

2) Preserving MovieLens data properties while expandingit: We now expose the fractal transform design we rely on topreserve the key statistical properties of the previous section.

Definition 1: Consider A ∈ Rm,n = (ai,j)i=1...n,j=1...m, wedenote the set {

∑mi=1 ai,j} of row-wise sums of A by R(M),

the set{∑n

j=1 ai,j

}of column-wise sums of A by C(M),

and the set of non-zero singular values of A by S(M).Definition 2: Consider an integer i and a non-zero positive

integer p, we denote the integer part of i − 1 in base pbi− 1cp = b i−1p c and the fractional part {i− 1}p = i −bi− 1cp.

First we focus on conservation properties in terms of row-wise and column-wise sums which correspond respectively tomarginalized user engagement and item popularity distribu-tions. In the following, × denotes the Minkowski product of

two sets, i.e. A×B = {a× b | a ∈ A, b ∈ B}.Proposition 1: Consider A ∈ Rm,n and B ∈ Rp,q and their

Kronecker product K = A⊗B. Then

R(K) = R(A)×R(B) and C(K) = C(A)× C(B).

Proof 1: Consider the ith row of K, by definition ofK the corresponding sum can be rewritten as follows:∑

j=1...np ki,j =∑

j=1...np abi−1cp+1,bj−1cq+1b{i−1}p,{j−1}qwhich in turn equals∑

j=1...n

∑j′=1...q

abi−1cp+1,jb{i−1}p,j′ .

Refactoring the two sums concludes the proof for the row-wise sum properties. The proof for column-wise properties isidentical. �

Theorem 1: Consider A ∈ Rm,n and B ∈ Rp,q and theirKronecker product K = A⊗B. Then

S(K) = S(A)× S(B).

Proof 2: One can easily check that (XY ) ⊗ (VW ) =(X ⊗ V )(Y ⊗W ) for any quadruple of matrices X,Y, V,Wfor which the notation makes sense and that (X ⊗ Y )T =XT ⊗ Y T . Let A = UAΣAVA be the SVD of B andB = VBΣBVB the SVD of B. Then (A ⊗ B) = (UA ⊗UB)(ΣA ⊗ ΣB)(VA ⊗ VB). Now, (UA ⊗ UB)T (UA ⊗ UB) =(UT

A ⊗ UTB )(UA ⊗ UB) = (UT

AUA) ⊗ (UTBUT

B ). Writing thesame decomposition for (VA⊗VB)(V T

A ⊗V TB ) and considering

that UA, UB are column-orthogonal while VA, VB are row-orthogonal concludes the proof. �

The properties above imply that knowing the row-wisesums, column-wise sums and singular value spectrum of thereduced rating matrix R and the original rating matrix Ris enough to deduce the corresponding properties for theexpanded rating matrix R — analytically. As in [28], the Kro-necker product enables analytic tractability while expandingdata sets in a fractal manner to orders of magnitude moredata.

3) Constructing a reduced R matrix with a similar spec-trum: Considering that the quasi “power-law” properties ofR imply — as in [28] — that S(R) × S(R) has a similardistribution to S(R), we seek a small R whose high orderstatistical properties are similar to those of R. As we want togenerate a dataset with several billion user/item interactions,millions of distinct users and millions of distinct items, weare looking for a matrix R with a few hundred or thousandrows and columns. The reduced matrix R we seek is thereforeorders of magnitude smaller than R. In order to produce areduced matrix R of dimensions (1000, 1700) one could usethe reduced size older MovieLens 100K dataset [18]. Such adataset can be interpreted as a sub-sampled reduced versionof MovieLens 20m with similar properties. However the datasets have been collected seven years apart and thereforetemporal non-stationarity issues become concerning. Also, weaim to produce an expansion method where the expansionmultipliers can be chosen flexibly by practitioners. In ourexperiments, it is noteworthy that naive uniform user and item

Page 7: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

sampling strategies have not yielded smaller matrices R withsimilar properties to R in our experiments. Different randomprojections [2], [14], [30] could more generally be employedhowever we rely on a procedure better tailored to our specificstatistical requirements.

We now describe the technique we employed to produce areduced size matrix R with first and second order propertiesclose to R which in turn led to constructing an expansionmatrix R = R ⊗ R similar to R. We want the dimensions ofR to be (m′, n′) with m′ << m and n′ << n. Consider againthe approximate Singular Value Decomposition (SVD) [20] ofR with the k = min(m′, n′) principal singular values of R:

R ' UΣV (4)

where U ∈ Rn,k has orthogonal columns, V ∈ Rk,m hasorthogonal rows, and Σ ∈ Rk,k is diagonal with non-negativeterms.

To reduce the number of rows and columns of R whilepreserving its top k singular values a trivial solution wouldconsist in replacing U and V by a small random orthogonalmatrices with few rows and columns respectively. Unfortu-nately such a method would only seemingly preserve thespectral properties of R as the principal singular vectors wouldbe widely changed. Such properties are important: one of thekey advantages of employing Kronecker products in [28] isthe preservation of the network values, i.e. the distributions ofsingular vector components of a Graph’s adjacency matrix.

To obtain a matrix U ∈ Rn′,k with fewer rows than U butcolumn-orthogonal and similar to U in the distribution of itsvalues we use the following procedure. We re-size U downto n′ rows with n′ < n through an averaging-based down-scaling method that can classically be found in standard im-age processing libraries (e.g. skimage.transform.resize in thescikit-image library [49]). Let U ∈ Rn′,k be the correspondingresized version of U . We then construct U as the columnorthogonal matrix in Rn′,k closest in Frobenius norm to U .Therefore as in [16] we compute

U = U(UT U

)−1/2. (5)

We apply a similar procedure to V to reduce its number ofcolumns which yields a row orthogonal matrix V ∈ Rk,m′

with m′ < m. The orthogonality of U (column-wise) and V(row-wise) guarantees that the singular value spectrum of

R = UΣV (6)

consists exactly of the k = min(m′, n′) leading componentsof the singular value spectrum of R. Like R, R is re-scaledto take values in [−1, 1]. The whole procedure to reduce Rdown to R is summarized in Algorithm 2.

We verify empirically that the distributions of values of thereduced singular vectors in U and V are similar to those ofU and V respectively to preserve first order properties of Rand value distributions of its singular vectors. Such propertiesare demonstrated through numerical experiments in the nextsection.

Algorithm 2 Compute reduced matrix R

(U,Σ, V )← sparseSVD(R, k)U ← imageResize(U, n′, k)V ← imageResize(V, k,m′)U ← U

(UT U

)−1/2V ←

(V V T

)−1/2V

Rtemp ← UΣV

M ← max(Rtemp)

m← min(Rtemp)

return Rtemp/(M −m)

V. EXPERIMENTATION ON MOVIELENS 20 MILLION DATA

The MovieLens 20m data comprises 20m ratings givenby 138 thousand users to 27 thousand items. In the presentsection, we demonstrate how the fractal Kronecker expansiontechnique we devised and presented helps scale up this datasetto orders of magnitude more users, items and interactions —all in a parallelizable and analytically tractable manner.

1) Size of expanded data set: In present experiments weconstruct a reduced rating matrix R of size (16, 32) whichimplies the resulting expanded data set will comprise 10billion interactions between 2 million users and 864K items.In appendix, we present the results obtained for a reducedrating matrix R of size (128, 256) and a synthetic data setconsisting of 655 billion interactions between 17 million usersand 7 million items.

Such a high number of interactions and items enable thetraining of deep neural collaborative models such as theNeural Collaborative Filtering model [19] with a scale whichis now more representative of industrial settings. Moreover,the increased data set size helps construct benchmarks fordeep learning software packages and ML accelerators thatemploy the same orders of magnitude than production settingsin terms of user base size, item vocabulary size and numberof observations.

2) Empirical properties of reduced R matrix: The con-struction technique of R had for objective to produce, justlike in [28], a matrix sharing the properties of R⊗R thoughsmaller in size. To that end, we aimed at constructing a matrixR of dimension (16, 32) with properties close to those of R interms of column-wise sum, row-wise sum and singular valuespectrum distributions.

We now check that the construction procedure we deviseddoes produce a R with the properties we expected. As theimpact of the re-sizing step is unclear from an analytic stand-point, we had to resort to numerical experiments to validateour method.

In Figure 3, one can assess that the first and second orderproperties of R and R match with high enough fidelity. Inparticular, the higher magnitude column-wise and row-wisesum distributions follow a “power-law” behavior similar tothat of the original matrix. Similar observations can be madeabout the singular value spectra of R and R.

Page 8: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

100 101 102 103 104 105

item rank

10-3

10-2

10-1

100

101

102

103

104

105to

tal ra

ting

Orignal matrix item-wiserating sums (log/log)

100 101 102

item rank

10-3

10-2

10-1

100

tota

l ra

ting

Reduced matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106

user rank

10-4

10-3

10-2

10-1

100

101

102

103

tota

l ra

ting

Original matrix user-wiserating sums (log/log)

100 101 102

user rank

10-1

100

101

tota

l ra

ting

Reduced matrix user-wiserating sums (log/log)

100 101 102 103 104

singular valuerank

101

102

103

singula

r valu

em

agnit

ude

Original matrix user/itemrating spectrum (log/log)

100 101 102

singular valuerank

10-1

100

101

singula

r valu

em

agnit

ude

Reduced matrix user/itemrating spectrum (log/log)

Fig. 3: Properties of the reduced dataset R ∈ R16,32 builtaccording to steps 4, 5 and 6. We validate the constructionmethod numerically by checking that the distribution of row-wise sums, column-wise sums and singular values are similarbetween R and R. Note here that as R is large we onlycompute its leading singular values. As we want to preservestatistical “power-laws”, we focus on preservation of the rela-tive distribution of values and not their magnitude in absolute.

There is therefore now a reasonable likelihood that ouradapted Kronecker expansion — although somewhat differingfrom the method originally presented in [28] — will enjoy thesame benefits in terms of enabling data set expansion whilepreserving high order statistical properties.

3) Empirical properties of the expanded data set R: Wenow verify empirically that the expanded rating matrix R =R ⊗ R does share common first and second order propertieswith the original rating matrix R. The new data size is 2 ordersof magnitude larger in terms of number of rows and columnsand 4 orders of magnitude larger in terms of number of non-zero terms. Notice here that, as in general R is a dense matrix,the level of sparsity of the expanded data set is the same asthat of the original.

Another benefit of using a fractal expansion method withanalytic tractability, is that we can deduce high order statis-tics of the expanded data set beforehand without having toinstantiate it. In particular, Proposition 1 implies that knowingthe column-wise and row-wise sum distributions of R and

100 101 102 103 104 105

item rank

10-3

10-2

10-1

100

101

102

103

104

105

tota

l ra

ting

Orignal matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106

item rank

10-510-410-310-210-1100101102103104105

tota

l ra

ting

Expanded matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106

user rank

10-4

10-3

10-2

10-1

100

101

102

103

tota

l ra

ting

Original matrix user-wiserating sums (log/log)

100 101 102 103 104 105 106 107

user rank

10-510-410-310-210-1100101102103104

tota

l ra

ting

Expanded matrix user-wiserating sums (log/log)

100 101 102 103 104

singular valuerank

101

102

103

singula

r valu

em

agnit

ude

Original matrix user/itemrating spectrum (log/log)

100 101 102 103 104 105

singular valuerank

100

101

102

103

singula

r valu

em

agnit

ude

Expanded matrix user/itemrating spectrum (log/log)

Fig. 4: High order statistical properties of the expanded datasetR ⊗ R. We validate the construction method numerically bychecking that the distributions of row-wise sums, column-wisesums and singular values are similar between R and R ⊗ R.Here we leverage the tractability of Kronecker products as theyimpact column-wise and row-wise sum distributions as well assingular value spectra. The plots corresponding to the extendeddataset are derived analytically based on the correspondingproperties of the reduced matrix R and the original matrix R.Note here that as we only computed the leading singular valuesof R, we only show the leading singular values of R⊗R. Inall plots we can observe the preservation of the linear log-log correspondence for the higher values in the distributionsof interest (row-wise sums, column-wise sums and singularvalues) as well as the accelerated decay of the smaller valuesin those distributions.

R is sufficient to determine the corresponding marginals forthe expanded data set R = R ⊗ R. Similarly, the leadingsingular values of the Kronecker product can be computedwith Theorem 1 just based on the leading singular values ofR and the singular values of R.

In Figure 4, one can confirm that the spectral propertiesof the expanded data set as well as the user engagement(row-wise sums) and item popularity (column-wise sums) aresimilar to those of the original data set. Such observationsindicate that the resulting data set is representative — in its fat-tailed data distribution and quasi “power-law” singular value

Page 9: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

0.0 0.5 1.0 1.5 2.0

rating rank 1e7

1.0

0.5

0.0

0.5

1.0

rati

ngs

Rating distributions

original

extended (20m samples)

Fig. 5: Sorted ratings in the original MovieLens 20m data setand the extended data. We sample 20m ratings from the newdata set. Multiplications inherent to Kronecker products createa finer granularity rating scale which somewhat differs fromthe original rating scale. However, we can check that the newrating scale is still representative of common recommendationproblems in that most rating values are close to the averageand few user/item interactions are very positive or negative.

spectrum — of problems encountered in ML for collaborativefiltering. Furthermore, the expanded data set reproduces someirregularities of the original data, in particular the acceleratingdecay of values in ranked row-wise and column-wise sums aswell as in the singular values spectrum.

4) Limitations: Although it does not condition the difficultyof rating matrix factorization problems — which dependsprimarily on the interaction distribution, the number of users,the number of items and sparsity of the rating matrix — thedistribution of ratings is still an important statistical propertyof the MovieLens 20m dataset. Figure 5 shows that there is acertain degree of divergence between the rating scales of theoriginal and synthetic data set. In particular, many more ratingvalues are present in the expanded data set as a result of themultiplication of terms from R and R. The transformationsturning R into R create a smoother scale of values leadingto a Kronecker product with many different possible ratings.Although the non-zero value distribution is a divergence pointbetween the two data sets, ratings in the synthetic data setare dominated by values which are close to the average as inthe original MovieLens 20m. Therefore the synthetic ratingsdo have certain degree of realism as they represent user/iteminteractions where strong reactions (positive or negative) fromusers are much less likely than neutral interactions.

Another limitation of the synthetic data set is the block-wise repetitive structure of Kronecker products. Although thesynthetic data set is still hard to factorize as the product oftwo low rank matrices because its singular values are stilldistributed similarly to the original data set, it is now easy tofactorize with a Kronecker SVD [23] which takes advantageof the block-wise repetitions in Eq (1). Randomized fractalexpansions which presented in Eq (2) and Eq (3) addressthis issue. A simple example of such a randomized variationaround the Kronecker product consists in shuffling rows and

columns of each block in Eq (1) independently at random.The shuffles will break the block-wise repetitive structure andprevent Kronecker SVD from producing a trivial solution tothe factorization problem.

As a result, the expansion technique we present appears asa reliable first candidate to train linear matrix factorizationmodels [42] and non-linear user/item similarity scoring mod-els [19].

VI. CONCLUSION

In conclusion, this paper presents a first attempt at synthe-sizing a realistic large-scale recommendation data sets withouthaving to make compromises in terms of user privacy. Weuse a small size publicly available data set, MovieLens 20m,and expand it to orders of magnitude more users, items andobserved ratings. Our expansion model is rooted into thehierarchical structure of user/item interactions which naturallysuggests a fractal extrapolation model.

We leverage Kronecker products as self-similar operators onuser/item rating matrices that impact key properties of row-wise and column-wise sums as well as singular value spectrain an analytically tractable manner. We modify the originalKronecker Graph generation method to enable an expansion ofthe original data by orders of magnitude that yields a syntheticdata set matching industrial recommendation data sets in scale.Our numerical experiments demonstrate the data set we createhas key first and second order properties similar to those ofthe original MovieLens 20m rating matrix.

Our next steps consist in making large synthetic data setspublicly available although any researcher can readily use thetechniques we presented to scale up any user/item interactionmatrix. Another possible direction is to adapt the presentmethod to recommendation data sets featuring meta-data (e.g.timestamps, topics, device information). The use of meta-datais indeed critical to solve the “cold-start” problem of usersand items having no interaction history with the platform. Wealso plan to benchmark the performance of well establishedbaselines on the new large scale realistic synthetic data weproduce.

REFERENCES

[1] ABDOLLAHPOURI, H., BURKE, R., AND MOBASHER, B. Controllingpopularity bias in learning-to-rank recommendation. In Proceedings ofthe Eleventh ACM Conference on Recommender Systems (2017), ACM,pp. 42–46.

[2] ACHLIOPTAS, D. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and SystemSciences 66, 4 (2003), 671–687.

[3] BELLETTI, F., BEUTEL, A., JAIN, S., AND CHI, E. Factorized recurrentneural architectures for longer range dependence. In InternationalConference on Artificial Intelligence and Statistics (2018), pp. 1522–1530.

[4] BELLETTI, F., SPARKS, E., BAYEN, A., AND GONZALEZ, J. Randomprojection design for scalable implicit smoothing of randomly observedstochastic processes. In Artificial Intelligence and Statistics (2017),pp. 700–708.

[5] BENNETT, J., LANNING, S., ET AL. The netflix prize. In Proceedingsof KDD cup and workshop (2007), vol. 2007, New York, NY, USA,p. 35.

Page 10: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

[6] BEUTEL, A., COVINGTON, P., JAIN, S., XU, C., LI, J., GATTO, V.,AND CHI, E. H. Latent cross: Making use of context in recurrentrecommender systems. In International Conference on Web Search andData Mining (2018), ACM, pp. 46–54.

[7] BHARGAVA, A., GANTI, R., AND NOWAK, R. Active positive semidefi-nite matrix completion: Algorithms, theory and applications. In ArtificialIntelligence and Statistics (2017), pp. 1349–1357.

[8] BLEI, D. M., NG, A. Y., AND JORDAN, M. I. Latent dirichlet allocation.Journal of machine Learning research 3, Jan (2003), 993–1022.

[9] CHANEY, A. J., BLEI, D. M., AND ELIASSI-RAD, T. A probabilisticmodel for using social networks in personalized item recommendation.In Proceedings of the 9th ACM Conference on Recommender Systems(2015), ACM, pp. 43–50.

[10] CHANEY, A. J., STEWART, B. M., AND ENGELHARDT, B. E. Howalgorithmic confounding in recommendation systems increases homo-geneity and decreases utility. arXiv preprint arXiv:1710.11214 (2017).

[11] COVINGTON, P., ADAMS, J., AND SARGIN, E. Deep neural networksfor youtube recommendations. In Proceedings of the 10th ACMConference on Recommender Systems (2016), ACM, pp. 191–198.

[12] CRANE, R., AND SORNETTE, D. Robust dynamic classes revealed bymeasuring the response function of a social system. Proceedings of theNational Academy of Sciences 105, 41 (2008), 15649–15653.

[13] CREMONESI, P., KOREN, Y., AND TURRIN, R. Performance of rec-ommender algorithms on top-n recommendation tasks. In Proceedingsof the fourth ACM conference on Recommender systems (2010), ACM,pp. 39–46.

[14] FRADKIN, D., AND MADIGAN, D. Experiments with random pro-jections for machine learning. In Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and datamining (2003), ACM, pp. 517–522.

[15] GOEL, S., BRODER, A., GABRILOVICH, E., AND PANG, B. Anatomy ofthe long tail: ordinary people with extraordinary tastes. In Proceedings ofthe third ACM international conference on Web search and data mining(2010), ACM, pp. 201–210.

[16] GOLUB, G. H., AND VAN LOAN, C. F. Matrix computations, vol. 3.JHU Press, 2012.

[17] HARIRI, N., MOBASHER, B., AND BURKE, R. Context-aware musicrecommendation based on latenttopic sequential patterns. In Proceedingsof the sixth ACM conference on Recommender systems (2012), ACM,pp. 131–138.

[18] HARPER, F. M., AND KONSTAN, J. A. The movielens datasets: Historyand context. Acm transactions on interactive intelligent systems (tiis) 5,4 (2016), 19.

[19] HE, X., LIAO, L., ZHANG, H., NIE, L., HU, X., AND CHUA, T.-S.Neural collaborative filtering. In Proceedings of the 26th InternationalConference on World Wide Web (2017), International World Wide WebConferences Steering Committee, pp. 173–182.

[20] HORN, R. A., HORN, R. A., AND JOHNSON, C. R. Matrix analysis.Cambridge university press, 1990.

[21] JAWANPURIA, P., AND MISHRA, B. A unified framework for structuredlow-rank matrix learning. In International Conference on MachineLearning (2018), pp. 2259–2268.

[22] JOLLIFFE, I. Principal component analysis. In International encyclope-dia of statistical science. Springer, 2011, pp. 1094–1096.

[23] KAMM, J., AND NAGY, J. G. Optimal kronecker product approximationof block toeplitz matrices. SIAM Journal on Matrix Analysis andApplications 22, 1 (2000), 155–172.

[24] KOREN, Y., BELL, R., AND VOLINSKY, C. Matrix factorizationtechniques for recommender systems. Computer, 8 (2009), 30–37.

[25] KRICHENE, W., MAYORAZ, N., RENDLE, S., ZHANG, L., YI, X.,HONG, L., CHI, E., AND ANDERSON, J. Efficient training on verylarge corpora via gramian estimation. arXiv preprint arXiv:1807.07187(2018).

[26] LECUN, Y., BENGIO, Y., AND HINTON, G. Deep learning. nature 521,7553 (2015), 436.

[27] LESKOVEC, J., CHAKRABARTI, D., KLEINBERG, J., AND FALOUTSOS,C. Realistic, mathematically tractable graph generation and evolution,using kronecker multiplication. In European Conference on Principles ofData Mining and Knowledge Discovery (2005), Springer, pp. 133–145.

[28] LESKOVEC, J., CHAKRABARTI, D., KLEINBERG, J., FALOUTSOS, C.,AND GHAHRAMANI, Z. Kronecker graphs: An approach to modelingnetworks. Journal of Machine Learning Research 11, Feb (2010), 985–1042.

[29] LEVY, M., AND BOSTEELS, K. Music recommendation and the longtail. In 1st Workshop On Music Recommendation And Discovery(WOMRAD), ACM RecSys, 2010, Barcelona, Spain (2010), Citeseer.

[30] LI, P., HASTIE, T. J., AND CHURCH, K. W. Very sparse randomprojections. In Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining (2006), ACM,pp. 287–296.

[31] LIANG, D., ALTOSAAR, J., CHARLIN, L., AND BLEI, D. M. Factoriza-tion meets the item embedding: Regularizing matrix factorization withitem co-occurrence. In Proceedings of the 10th ACM conference onrecommender systems (2016), ACM, pp. 59–66.

[32] LINDEN, G., SMITH, B., AND YORK, J. Amazon. com recommenda-tions: Item-to-item collaborative filtering. IEEE Internet computing, 1(2003), 76–80.

[33] LU, J., LIANG, G., SUN, J., AND BI, J. A sparse interactive modelfor matrix completion with side information. In Advances in neuralinformation processing systems (2016), pp. 4071–4079.

[34] MAHDIAN, M., AND XU, Y. Stochastic kronecker graphs. In Interna-tional Workshop on Algorithms and Models for the Web-Graph (2007),Springer, pp. 179–186.

[35] MANDELBROT, B. B. The fractal geometry of nature, vol. 1. WHfreeman New York, 1982.

[36] NARAYANAN, A., AND SHMATIKOV, V. How to break anonymity ofthe netflix prize dataset. arXiv preprint cs/0610105 (2006).

[37] NIMISHAKAVI, M., JAWANPURIA, P. K., AND MISHRA, B. A dualframework for low-rank tensor completion. In Advances in NeuralInformation Processing Systems (2018), pp. 5489–5500.

[38] OESTREICHER-SINGER, G., AND SUNDARARAJAN, A. Recommenda-tion networks and the long tail of electronic commerce. Mis quarterly(2012), 65–83.

[39] PARK, Y.-J., AND TUZHILIN, A. The long tail of recommender systemsand how to leverage it. In Proceedings of the 2008 ACM conference onRecommender systems (2008), ACM, pp. 11–18.

[40] PIPIRAS, V., AND TAQQU, M. S. Long-range dependence and self-similarity, vol. 45. Cambridge university press, 2017.

[41] RESNICK, P., IACOVOU, N., SUCHAK, M., BERGSTROM, P., ANDRIEDL, J. Grouplens: an open architecture for collaborative filteringof netnews. In Proceedings of the 1994 ACM conference on Computersupported cooperative work (1994), ACM, pp. 175–186.

[42] RICCI, F., ROKACH, L., AND SHAPIRA, B. Recommender systems:introduction and challenges. In Recommender systems handbook.Springer, 2015, pp. 1–34.

[43] RUDOLPH, M., RUIZ, F., MANDT, S., AND BLEI, D. Exponentialfamily embeddings. In Advances in Neural Information ProcessingSystems (2016), pp. 478–486.

[44] SARWAR, B., KARYPIS, G., KONSTAN, J., AND RIEDL, J. Item-basedcollaborative filtering recommendation algorithms. In Proceedings ofthe 10th international conference on World Wide Web (2001), ACM,pp. 285–295.

[45] SCHMIT, S., AND RIQUELME, C. Human interaction with recommen-dation systems. arXiv preprint arXiv:1703.00535 (2017).

[46] SHANI, G., HECKERMAN, D., AND BRAFMAN, R. I. An mdp-basedrecommender system. Journal of Machine Learning Research 6, Sep(2005), 1265–1295.

[47] TANG, J., AND WANG, K. Personalized top-n sequential recommenda-tion via convolutional sequence embedding. In International Conferenceon Web Search and Data Mining (2018), IEEE, pp. 565–573.

[48] TU, K., CUI, P., WANG, X., WANG, F., AND ZHU, W. Structural deepembedding for hyper-networks. In Thirty-Second AAAI Conference onArtificial Intelligence (2018).

[49] VAN DER WALT, S., SCHONBERGER, J. L., NUNEZ-IGLESIAS, J.,BOULOGNE, F., WARNER, J. D., YAGER, N., GOUILLART, E., ANDYU, T. scikit-image: image processing in python. PeerJ 2 (2014), e453.

[50] WANG, H., WANG, N., AND YEUNG, D.-Y. Collaborative deep learningfor recommender systems. In ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (2015), ACM, pp. 1235–1244.

[51] YIN, H., CUI, B., LI, J., YAO, J., AND CHEN, C. Challenging the longtail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012),896–907.

[52] YU, F., LIU, Q., WU, S., WANG, L., AND TAN, T. A dynamic recurrentmodel for next basket recommendation. In Proceedings of the 39thInternational ACM SIGIR conference on Research and Development inInformation Retrieval (2016), ACM, pp. 729–732.

Page 11: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

[53] ZHANG, J.-D., CHOW, C.-Y., AND XU, J. Enabling kernel-basedattribute-aware matrix factorization for rating prediction. IEEE Trans-actions on Knowledge and Data Engineering 29, 4 (2017), 798–812.

[54] ZHAO, Q., CHEN, J., CHEN, M., JAIN, S., BEUTEL, A., BELLETTI,F., AND CHI, E. H. Categorical-attributes-based item classification forrecommender systems. In Proceedings of the 12th ACM Conference onRecommender Systems (2018), ACM, pp. 320–328.

[55] ZHENG, Y., TANG, B., DING, W., AND ZHOU, H. A neural autoregres-sive approach to collaborative filtering. arXiv preprint arXiv:1605.09477(2016).

[56] ZHOU, B., HUI, S. C., AND CHANG, K. An intelligent recommendersystem using sequential web access patterns. In IEEE conference oncybernetics and intelligent systems (2004), vol. 1, IEEE Singapore,pp. 393–398.

[57] ZHOU, G., ZHU, X., SONG, C., FAN, Y., ZHU, H., MA, X., YAN, Y.,JIN, J., LI, H., AND GAI, K. Deep interest network for click-throughrate prediction. In Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining (2018), ACM,pp. 1059–1068.

100 101 102 103 104 105

item rank

10-3

10-2

10-1

100

101

102

103

104

105

tota

l ra

ting

Orignal matrix item-wiserating sums (log/log)

100 101 102 103

item rank

10-3

10-2

10-1

100

101

tota

l ra

ting

Reduced matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106

user rank

10-4

10-3

10-2

10-1

100

101

102

103

tota

l ra

ting

Original matrix user-wiserating sums (log/log)

100 101 102 103

user rank

10-2

10-1

100

101

tota

l ra

ting

Reduced matrix user-wiserating sums (log/log)

100 101 102 103 104

singular valuerank

101

102

103

singula

r valu

em

agnit

ude

Original matrix user/itemrating spectrum (log/log)

100 101 102 103

singular valuerank

10-1

100

101

singula

r valu

em

agnit

ude

Reduced matrix user/itemrating spectrum (log/log)

Fig. 6: Properties of the reduced dataset R ∈ R128,256. Oncemore, we validate the construction method numerically bychecking that the distribution of row-wise sums, column-wisesums and singular values are similar between R and R.

APPENDIX

MovieLens 655 billion

In this section, we present the properties of a larger expan-sion for MovieLens. The reduced rating matrix R is now ofsize (128, 256) and the synthetic data consists of 655 billioninteractions between 17 million users and 7 million items.

Empirical properties of the reduced matrix R: We assessthe scalability of the approach we present to synthesize R. Inparticular, we check that with a size of (128, 256) instead of(16, 32) R still shares common statistical properties with theoriginal matrix R. Figure 6 demonstrates that the construc-tion method we devised for R still preserves key statisticalproperties of R.

Numerical validation for the expanded data set: We nowverify that even with different extension factors and a muchlarger size, the synthetic data set we generate is similar tothe original MovieLens 20m. We focus on the distribution ofcolumn-wise and row-wise sums in R = R×R as well as thesingular value distribution of the expanded matrix. In Figure 7,we find again that the “power-law” statistical behaviors andtheir inflections are preserved by the expansion procedure wedesigned.

Limitations: The previous observations demonstrate thescalability and robustness of our expansion method, even withan expansion factor of (128, 256). However, the same limita-tions are present as in the smaller case and Figure 8 shows that

Page 12: Scalable Realistic Recommendation Datasets through Fractal ......recommendation data. In order to provide a new data set — more aligned with the needs of production scale Recommender

100 101 102 103 104 105

item rank

10-3

10-2

10-1

100

101

102

103

104

105

tota

l ra

ting

Orignal matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106 107

item rank

10-610-510-410-310-210-1100101102103104105

tota

l ra

ting

Expanded matrix item-wiserating sums (log/log)

100 101 102 103 104 105 106

user rank

10-4

10-3

10-2

10-1

100

101

102

103

tota

l ra

ting

Original matrix user-wiserating sums (log/log)

100 101 102 103 104 105 106 107 108

user rank

10-510-410-310-210-1100101102103104

tota

l ra

ting

Expanded matrix user-wiserating sums (log/log)

100 101 102 103 104

singular valuerank

101

102

103

singula

r valu

em

agnit

ude

Original matrix user/itemrating spectrum (log/log)

100 101 102 103 104 105 106

singular valuerank

100

101

102

103

104

singula

r valu

em

agnit

ude

Expanded matrix user/itemrating spectrum (log/log)

Fig. 7: We again validate the construction method numericallyby checking that the distributions of row-wise sums, column-wise sums and singular values are similar between R andR ⊗ R. Here as well, we can observe the similarity betweenkey features of the original rating matrix and its syntheticexpansion.

0.0 0.5 1.0 1.5 2.0

rating rank 1e7

1.0

0.5

0.0

0.5

1.0

rati

ngs

Rating distributions

original

extended (20m samples)

Fig. 8: Sorted ratings in the original MovieLens 20m dataset and the extended data. We sample 20m ratings from thenew data set. Again we observe a certain divergence betweenthe synthetic data set and the original. There is some realismthough in that most interactions being near neutral.

a similar divergence in rating scales exists between the originaldata set and its expanded synthetic version. Like before thesynthetic ratings remain realistic in that their majority is nearaverage. The present section therefore demonstrates that ourmethod scales up and is able to synthesize very large realisticrecommendation data sets.