google news personalization

22
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai

Upload: bazyli

Post on 13-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Google News Personalization. Big Data reading group November 12, 2007 Presented by Babu Pillai. Problem: finding stuff on Internet. Know what you want: content-based filtering, search Don’t know browse How to handle: Don’t know but, show me something interesting!. Google News. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Google News Personalization

Google News Personalization

Big Data reading groupNovember 12, 2007

Presented by Babu Pillai

Page 2: Google News Personalization

Problem: finding stuff on Internet

• Know what you want: – content-based filtering,– search

• Don’t know– browse

• How to handle: Don’t know but, show me something interesting!

Page 3: Google News Personalization

Google News• Top Stories

• Recommendationsfor registered users

• Based on userclick history,community clicks

Page 4: Google News Personalization

Problem Scale

• Lots of users, (more is good)– Millions of clicks from millions of users

• Problem: high churn in item set– Several million items (clusters of news articles

about the same story, as identified by GN) per month

– Continuous addition, deletion

• Strict timing (few hundred ms)• Existing systems not suitable

Page 5: Google News Personalization

Memory-based Ratings

• General form:

where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui

• Problem: scalability, even when similarity is computed offline

Page 6: Google News Personalization

Model-based techniques

• Clustering / segmentation, e.g. based on interests

• Bayesian models, Markov Decision, …– All are computationally expensive

Page 7: Google News Personalization

What’s in this paper?

• Investigate 2 different ways to cluster users: MinHash, and PLSI

• Implement both on MapReduce

Page 8: Google News Personalization

Google News Rating Model

• 1 click = 1 positive vote

• Noisier than 1-5 ranking (Netflix)

• No explicit negatives

• Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested

Page 9: Google News Personalization

Design guidelines for a scalable rating system

• Associate users into clusters of similar users (based on prior clicks, offline)

• Users can belong to multiple clusters

• Generate rating using much smaller sets of user clusters, rather than all users:

Page 10: Google News Personalization

Technique 1: MinHash

• Probabilistically assign users to clusters based on click history

• Use Jaccard coefficient:

distance is a metric

• Using this metric is computationally expensive, not feasible even offline

Page 11: Google News Personalization

MinHash as a form of Locality Sensitive Hashing

• Basic idea: assign hash value to each use based on click history

• How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user

• Probability that 2 users have the same hash is equal to the Jaccard coefficient

Page 12: Google News Personalization

Using MinHash for clusters

• Concatenate p>1 such hashes as cluster id for increased precision

• Apply q>1 in parallel (users belong to q clusters) to improve recall

• Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds

Page 13: Google News Personalization

MinHash on MapReduce

• Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation

• Map using cluster id’s as keys

• Reduce to form membership lists for each cluster id

Page 14: Google News Personalization

Technique 2: PLSI clustering

• Probabilistic Latent Semantic Indexing• Main idea: hidden state z that correlates

users and items

• Generate this clustering from training set based on EM algorithm give by Hoffman04– Iterative technique, generates new probability

estimates based on previous estimates

Page 15: Google News Personalization

PLSI as MapReduce

• Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively)

• Reduce is simply addition

Page 16: Google News Personalization

PLSI in a dynamic environment

• Treat Z as user clusters

• On each click, update p(s|z) for all clusters the user belongs to

• This approximates PLSI, but is updated dynamically as additional items are added

• Does not allow additions of users

Page 17: Google News Personalization

Cluster-based recommendation

• For each cluster, maintain number of clicks, decayed by time, for each item visited by a member

• For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks

• Do this using both MinHash and PLSI clustering

Page 18: Google News Personalization

One more technique: Covisitation

• Memory-based technique• Create adjacency matrix between all pairs of

items (can be directed)• Increment corresponding count if one item

visited soon after another

• Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately

Page 19: Google News Personalization

Whole System

• Offline clustering

• Online click history update, cluster item stats update, covisitation update

Page 20: Google News Personalization

Results

Generally around 30-50% better than popularity based recommendations

Page 21: Google News Personalization

Techniques don’t work well together, though

Page 22: Google News Personalization

Discussion

• Covisitation appears to work as well as clustering

• Operational details missing: how big are cluster memberships, etc.

• All of the clustering is done offline