google news personalization

Google News Personalization

Big Data reading groupNovember 12, 2007

Presented by Babu Pillai

Problem: finding stuff on Internet

• Know what you want: – content-based filtering,– search

• Don’t know– browse

• How to handle: Don’t know but, show me something interesting!

Google News• Top Stories

• Recommendationsfor registered users

• Based on userclick history,community clicks

Problem Scale

• Lots of users, (more is good)– Millions of clicks from millions of users

• Problem: high churn in item set– Several million items (clusters of news articles

about the same story, as identified by GN) per month

– Continuous addition, deletion

• Strict timing (few hundred ms)• Existing systems not suitable

Memory-based Ratings

• General form:

where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui

• Problem: scalability, even when similarity is computed offline

Model-based techniques

• Clustering / segmentation, e.g. based on interests

• Bayesian models, Markov Decision, …– All are computationally expensive

What’s in this paper?

• Investigate 2 different ways to cluster users: MinHash, and PLSI

• Implement both on MapReduce

Google News Rating Model

• 1 click = 1 positive vote

• Noisier than 1-5 ranking (Netflix)

• No explicit negatives

• Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested

Design guidelines for a scalable rating system

• Associate users into clusters of similar users (based on prior clicks, offline)

• Users can belong to multiple clusters

• Generate rating using much smaller sets of user clusters, rather than all users:

Technique 1: MinHash

• Probabilistically assign users to clusters based on click history

• Use Jaccard coefficient:

distance is a metric

• Using this metric is computationally expensive, not feasible even offline

MinHash as a form of Locality Sensitive Hashing

• Basic idea: assign hash value to each use based on click history

• How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user

• Probability that 2 users have the same hash is equal to the Jaccard coefficient

Using MinHash for clusters

• Concatenate p>1 such hashes as cluster id for increased precision

• Apply q>1 in parallel (users belong to q clusters) to improve recall

• Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds

MinHash on MapReduce

• Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation

• Map using cluster id’s as keys

• Reduce to form membership lists for each cluster id

Technique 2: PLSI clustering

• Probabilistic Latent Semantic Indexing• Main idea: hidden state z that correlates

users and items

• Generate this clustering from training set based on EM algorithm give by Hoffman04– Iterative technique, generates new probability

estimates based on previous estimates

PLSI as MapReduce

• Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively)

• Reduce is simply addition

PLSI in a dynamic environment

• Treat Z as user clusters

• On each click, update p(s|z) for all clusters the user belongs to

• This approximates PLSI, but is updated dynamically as additional items are added

• Does not allow additions of users

Cluster-based recommendation

• For each cluster, maintain number of clicks, decayed by time, for each item visited by a member

• For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks

• Do this using both MinHash and PLSI clustering

One more technique: Covisitation

• Memory-based technique• Create adjacency matrix between all pairs of

items (can be directed)• Increment corresponding count if one item

visited soon after another

• Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately

Whole System

• Offline clustering

• Online click history update, cluster item stats update, covisitation update

Results

Generally around 30-50% better than popularity based recommendations

Techniques don’t work well together, though

Discussion

• Covisitation appears to work as well as clustering

• Operational details missing: how big are cluster memberships, etc.

• All of the clustering is done offline

google news personalization

Documents

users ua

parallel users

clusters of similar

q clusters

user ua

user clusterson

contentbased filtering

smaller sets of user