netflix recommendations - beyond the 5 stars

Ne#lix Recommenda/ons Beyond the 5 Stars ACM SF-‐Bay Area October 22, 2012 Xavier Amatriain Personaliza?on Science and Engineering -‐ NeDlix

@xamat

Outline

1.  The Netflix Prize & the Recommendation Problem

2.  Anatomy of Netflix Personalization 3.  Data & Models 4.  And…

a)  Consumer (Data) Science b)  Or Software Architectures

What we were interested in: §  High quality recommendations

Proxy question: §  Accuracy in predicted rating

§  Improve by 10% = $1million!

•  Top 2 algorithms still in production

Results

SVD

RBM

What about the final prize ensembles? §  Our offline studies showed they were too computationally

intensive to scale

§  Expected improvement not worth the engineering effort

§  Plus…. Focus had already shifted to other issues that had more impact than rating prediction.

5

Change of focus

6 2006 2012

Anatomy of Netflix Personalization

Everything is a Recommendation

Everything is personalized

8

Note: Recommendations are per household, not individual user R

ows

Ranking

Top 10

9

Personalization awareness

Diversity

Dad All Son Daughter Dad&Mom Mom All Daughter Mom All?

Support for Recommendations

10 Social Support

Social Recommendations

11

Watch again & Continue Watching

12

Gen

res

13

Genre rows §  Personalized genre rows focus on user interest

§  Also provide context and “evidence” §  Important for member satisfaction – moving personalized rows to top on

devices increased retention

§  How are they generated? §  Implicit: based on user’s recent plays, ratings, & other interactions §  Explicit taste preferences §  Hybrid:combine the above §  Also take into account: §  Freshness - has this been shown before? §  Diversity– avoid repeating tags and genres, limit number of TV genres, etc.

Genres - personalization

15

Genres - personalization

16

17

Genres- explanations

Genres- explanations

18

19

Genres – user involvement

Genres – user involvement

20

§  Displayed in many different contexts §  In response to

user actions/context (search, queue add…)

§  More like… rows

Similars

Anatomy of a Personalization - Recap §  Everything is a recommendation: not only rating

prediction, but also ranking, row selection, similarity…

§  We strive to make it easy for the user, but…

§  We want the user to be aware and be involved in the recommendation process

§  Deal with implicit/explicit and hybrid feedback

§  Add support/explanations for recommendations

§  Consider issues such as diversity or freshness

22

Data &

Models

24

Big Data @Netflix §  Almost 30M subscribers

§  Ratings: 4M/day

§  Searches: 3M/day

§  Plays: 30M/day

§  2B hours streamed in Q4 2011

§  1B hours in June 2012

25

Smart Models §  Logistic/linear regression §  Elastic nets §  SVD and other MF models §  Restricted Boltzmann Machines §  Markov Chains §  Different clustering approaches §  LDA §  Association Rules §  Gradient Boosted Decision Trees §  …

SVD X[n x m] = U[n x r] S [ r x r] (V[m x r])T

§  X: m x n matrix (e.g., m users, n videos)

§  U: m x r matrix (m users, r concepts)

§  S: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

§  V: r x n matrix (n videos, r concepts)

Simon Funk’s SVD §  One of the most

interesting findings during the Netflix Prize came out of a blog post

§  Incremental, iterative, and approximate way to compute the SVD using gradient descent

27

SVD for Rating Prediction §  User factor vectors and item-factors vector §  Baseline (user & item deviation from average)

§  Predict rating as §  SVD++ (Koren et. Al) asymmetric variation w. implicit feedback

§  Where §  are three item factor vectors §  Users are not parametrized, but rather represented by:

§  R(u): items rated by user u §  N(u): items for which the user has given implicit preference (e.g. rated vs. not rated)

28

pu ∈ℜ f qv ∈ℜ f

ruv' = buv + pu

Tqvbuv = µ + bu + bv

ruv' = buv + qv

T R(u) −12 (ruj − buj )x j +j∈R(u)∑ N(u) −

12 yjj∈N (u)∑

$

%&&

'

())

qv, xv, yv ∈ℜ f

Artificial Neural Networks – 4 generations §  1st - Perceptrons (~60s)

§  Single layer of hand-coded features §  Linear activation function §  Fundamentally limited in what they can learn to do.

§  2nd - Back-propagation (~80s) §  Back-propagate error signal to get derivatives for learning §  Non-linear activation function

§  3rd - Belief Networks (~90s) §  Directed acyclic graph composed of (visible & hidden) stochastic variables

with weighted connections. §  Infer the states of the unobserved variables & learn interactions between

variables to make network more likely to generate observed data.

29

Restricted Boltzmann Machines §  Restrict the connectivity to make learning easier.

§  Only one layer of hidden units. §  Although multiple layers are possible

§  No connections between hidden units. §  Hidden units are independent given the visible

states.. §  So we can quickly get an unbiased sample from

the posterior distribution over hidden “causes” when given a data-vector

§  RBMs can be stacked to form Deep Belief Nets (DBN) – 4th generation of ANNs

hidden

i

j

visible

RBM for the Netflix Prize

31

Ranking Key algorithm, sorts titles in most contexts

Ranking §  Ranking = Scoring + Sorting + Filtering

bags of movies for presentation to a user §  Goal: Find the best possible ordering of a

set of videos for a user within a specific context in real-time

§  Objective: maximize consumption §  Aspirations: Played & “enjoyed” titles have

best score §  Akin to CTR forecast for ads/search results

§  Factors §  Accuracy §  Novelty §  Diversity §  Freshness §  Scalability §  …

Ranking §  Popularity is the obvious baseline §  Ratings prediction is a clear secondary data

input that allows for personalization §  We have added many other features (and tried

many more that have not proved useful) §  What about the weights?

§  Based on A/B testing §  Machine-learned

Example: Two features, linear model

35

Popularity

Pre

dict

ed R

atin

g

1

2 3

4

5

Linear Model: frank(u,v) = w1 p(v) + w2 r(u,v) + b

Final Ranking

Ranking

Learning to rank §  Machine learning problem: goal is to construct ranking

model from training data §  Training data can have partial order or binary judgments

(relevant/not relevant). §  Resulting order of the items typically induced from a

numerical score §  Learning to rank is a key element for personalization §  You can treat the problem as a standard supervised

classification problem

40

Learning to Rank Approaches 1.  Pointwise

§  Ranking function minimizes loss function defined on individual relevance judgment

§  Ranking score based on regression or classification §  Ordinal regression, Logistic regression, SVM, GBDT, …

2.  Pairwise §  Loss function is defined on pair-wise preferences §  Goal: minimize number of inversions in ranking §  Ranking problem is then transformed into the binary classification

problem §  RankSVM, RankBoost, RankNet, FRank…

Learning to rank - metrics §  Quality of ranking measured using metrics as

§  Normalized Discounted Cumulative Gain §  Mean Reciprocal Rank (MRR) §  Fraction of Concordant Pairs (FCP) §  Others…

§  But, it is hard to optimize machine-learned models directly on these measures (they are not differentiable)

§  Recent research on models that directly optimize ranking measures

42

NDCG =DCGIDCG

DCG = relevance1 +relevanceilog2 i2

n

∑

MRR = 1H

1rank(hi )h∈H

∑

FCP =CP(xi, x j )

i≠ j∑n(n−1)

2

Learning to Rank Approaches 3.  Listwise

a.  Indirect Loss Function §  RankCosine: similarity between ranking list and ground truth as loss function §  ListNet: KL-divergence as loss function by defining a probability distribution §  Problem: optimization of listwise loss function may not optimize IR metrics

b.  Directly optimizing IR measures (difficult since they are not differentiable) §  Directly optimize IR measures through Genetic Programming §  Directly optimize measures with Simulated Annealing §  Gradient descent on smoothed version of objective function (e.g. CLiMF

presented at Recsys 2012 or TFMAP at SIGIR 2012) §  SVM-MAP relaxes the MAP metric by adding it to the SVM constraints §  AdaRank uses boosting to optimize NDCG

44

Similars

§  Different similarities computed from different sources: metadata, ratings, viewing data…

§  Similarities can be treated as data/features

§  Machine Learned models improve our concept of “similarity”

Data & Models - Recap §  All sorts of feedback from the user can help generate better

recommendations §  Need to design systems that capture and take advantage of

all this data §  The right model is as important as the right data §  It is important to come up with new theoretical models, but

also need to think about application to a domain, and practical issues

§  Rating prediction models are only part of the solution to recommendation (think about ranking, similarity…)

45

More data or better models?

46

Really?

Anand Rajaraman: Stanford & Senior VP at Walmart Global eCommerce (former Kosmix)

47

Sometimes, it’s not about more data


48

[Banko and Brill, 2001]

Norvig: “Google does not have better Algorithms, only more Data”

Many features/ low-bias models


49

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0 1000000 2000000 3000000 4000000 5000000 6000000

Model performance vs. sample size (actual Netflix system)


Sometimes, it’s not about more data

50


Data without a sound approach = noise

Consumer (Data) Science

Consumer Science

§  Main goal is to effectively innovate for customers §  Innovation goals

§  “If you want to increase your success rate, double your failure rate.” – Thomas Watson, Sr., founder of IBM

§  The only real failure is the failure to innovate §  Fail cheaply §  Know why you failed/succeeded

52

Consumer (Data) Science 1.  Start with a hypothesis:

§  Algorithm/feature/design X will increase member engagement with our service, and ultimately member retention

2.  Design a test §  Develop a solution or prototype §  Think about dependent & independent variables, control,

significance…

3.  Execute the test 4.  Let data speak for itself

53

Offline/Online testing process

54

Rollout Feature to all users

Offline testing

Online A/B testing [success] [success]

[fail]

days Weeks to months

Offline testing §  Optimize algorithms offline §  Measure model performance, using metrics such as:

§  Mean Reciprocal Rank, Normalized Discounted Cumulative Gain, Fraction of Concordant Pairs, Precision/Recall & F-measures, AUC, RMSE, Diversity…

§  Offline performance used as an indication to make informed decisions on follow-up A/B tests

§  A critical (and unsolved) issue is how offline metrics can correlate with A/B test results.

§  Extremely important to define a coherent offline evaluation framework (e.g. How to create training/testing datasets is not trivial)

55

Executing A/B tests §  Many different metrics, but ultimately trust user

engagement (e.g. hours of play and customer retention)

§  Think about significance and hypothesis testing §  Our tests usually have thousands of members and 2-20 cells

§  A/B Tests allow you to try radical ideas or test many approaches at the same time. §  We typically have hundreds of customer A/B tests running

§  Decisions on the product always data-driven

56

What to measure §  OEC: Overall Evaluation Criteria

§  In an AB test framework, the measure of success is key

§  Short-term metrics do not always align with long term goals §  E.g. CTR: generating more clicks might mean that our

recommendations are actually worse

§  Use long term metrics such as LTV (Life time value) whenever possible §  In Netflix, we use member retention

57

What to measure §  Short-term metrics can sometimes be informative, and

may allow for faster decision-taking §  At Netflix we use many such as hours streamed by users or

%hours from a given algorithm

§  But, be aware of several caveats of using early decision mechanisms

58

Initial effects appear to trend. See “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained” [Kohavi et. Al. KDD 12]

Consumer Data Science - Recap §  Consumer Data Science aims to innovate for the

customer by running experiments and letting data speak

§  This is mainly done through online AB Testing

§  However, we can speed up innovation by experimenting offline

§  But, both for online and offline experimentation, it is important to chose the right metric and experimental framework

59

60

Architectures

Technology

61 hTp://techblog.neDlix.com

63

Event & Data Distribution

64

•  UI devices should broadcast many different kinds of user events

•  Clicks •  Presentations •  Browsing events •  …

•  Events vs. data •  Some events only need to be

propagated and trigger an action (low latency, low information per event)

•  Others need to be processed and “turned into” data (higher latency, higher information quality).

•  And… there are many in between •  Real-time event flow managed

through internal tool (Manhattan) •  Data flow mostly managed through

Hadoop.

Event & Data Distribution

65

Offline Jobs

66

•  Two kinds of offline jobs •  Model training •  Batch offline computation of

recommendations/intermediate results

•  Offline queries either in Hive or PIG

•  Need a publishing mechanism that solves several issues

•  Notify readers when result of query is ready

•  Support different repositories (s3, cassandra…)

•  Handle errors, monitoring… •  We do this through Hermes

Offline Jobs

67

Computation

68

•  Two ways of computing personalized results

•  Batch/offline •  Online

•  Each approach has pros/cons •  Offline

+  Allows more complex computations +  Can use more data -  Cannot react to quick changes -  May result in staleness

•  Online +  Can respond quickly to events +  Can use most recent data -  May fail because of SLA -  Cannot deal with “complex”

computations •  It’s not an either/or decision

•  Both approaches can be combined

Computation

69

Signals & Models

70

•  Both offline and online algorithms are based on three different inputs:

•  Models: previously trained from existing data

•  (Offline) Data: previously processed and stored information

•  Signals: fresh data obtained from live services

•  User-related data •  Context data (session, date,

time…)

Signals & Models

71

Results

72

•  Recommendations can be serviced from:

•  Previously computed lists •  Online algorithms •  A combination of both

•  The decision on where to service the recommendation from can respond to many factors including context.

•  Also, important to think about the fallbacks (what if plan A fails)

•  Previously computed lists/intermediate results can be stored in a variety of ways

•  Cache •  Cassandra •  Relational DB

Results

Alerts and Monitoring §  A non-trivial concern in large-scale recommender

systems

§  Monitoring: continuously observe quality of system

§  Alert: fast notification if quality of system goes below a certain pre-defined threshold

§  Questions: §  What do we need to monitor? §  How do we know something is “bad enough” to alert

73

What to monitor §  Staleness

§  Monitor time since last data update

74

Did something go wrong here?

What to monitor §  Algorithmic quality

§  Monitor different metrics by comparing what users do and what your algorithm predicted they would do

75

What to monitor §  Algorithmic quality

§  Monitor different metrics by comparing what users do and what your algorithm predicted they would do

76


What to monitor §  Algorithmic source for users

§  Monitor how users interact with different algorithms

77

Algorithm X

New version


When to alert §  Alerting thresholds are hard to tune

§  Avoid unnecessary alerts (the “learn-to-ignore problem”) §  Avoid important issues being noticed before the alert happens

§  Rules of thumb §  Alert on anything that will impact user experience significantly §  Alert on issues that are actionable §  If a noticeable event happens without an alert… add a new alert

for next time

78

79

Conclusions

The Personalization Problem §  The Netflix Prize simplified the recommendation problem

to predicting ratings

§  But… §  User ratings are only one of the many data inputs we have §  Rating predictions are only part of our solution

§  Other algorithms such as ranking or similarity are very important

§  We can reformulate the recommendation problem §  Function to optimize: probability a user chooses something and

enjoys it enough to come back to the service

80

More data + Better models +

More accurate metrics + Better approaches & architectures

81

Lots of room for improvement!

Thanks!

We’re hiring! Xavier Amatriain (@xamat)

[email protected]