abhinandan das, mayur datar, ashutosh garg , may 8-12, 2007, banff, alberta, canada presented by,...

Click here to load reader

Upload: devante-teesdale

Post on 29-Mar-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

  • Slide 1

Abhinandan Das, Mayur Datar, Ashutosh Garg WWW2007, May 8-12, 2007, Banff, Alberta, Canada Presented By, Naveenkumar Selvaraj Google News Personalization: Scalable Online Collaborative Filtering Slide 2 Google News A news aggregator website Personalized news- Records search queries and news clicks Makes them accessible online Recommended for you top stories based on the users past click history Slide 3 Slide 4 Collaborative Filtering Learn User Preferences Make recommendations based on User Data Community Data Ex : Amazon.com Slide 5 Contribution of this paper Aims to present recommendations to signed-in users for Google News based on Users click history Click history of the community Generate recommendations using three approaches- collaborative filtering using MinHash Clustering Probabilistic Latent Semantic Indexing (PLSI) Covisitation counts Slide 6 Problem Statement Given click history for N users and M items and given a specific user u with click history C u consisting of stories, recommend K stories to the user, which he/she may be interested in reading. Slide 7 Challenges Scalability has several million users visiting it and several million items Large number of item churns Most systems including Amazon assume amount of churn static or minimal For Google News large amount of churn. Stories of interest are the ones that appeared in the last couple of hours Strict Timing Requirements Strict Response time requirement less than a second Must also take into account time for Generating news story clusters by accessing various indexes storing the content Generating the HTML content as a response for HTTP request Hence only a few milliseconds left Slide 8 Challenges ( Contd..) Treating click as positive vote More noisy than giving 1-5 ratings (like Amazon.com ) Clicks do not say anything about the users negative interest Slide 9 Related Work Recommender Systems Content-based Filtering Items similar to the ones rated highly by the user recommended Collaborative Filtering Two types Memory Based Rating predictions based on users past rating Weighted average of the ratings given by other users ( weight proportional to similarity between users) Challenge Make it more scalable Slide 10 Related Work (Contd..) Model-based Model the users based on past ratings Use the models to predict ratings on unseen items Shortcomings of earlier work in Model-based oCategorizes each user into only one class oUser may have different tastes for different topics Slide 11 Work Done Mix of Memory Based and Model Based algorithms used. Model Based MinHash and PLSI ( Probabilistic Latent Semantic Indexing) Memory Based Covisitation Each algorithm - assigns a numeric score to a story Scores combined obtain a ranked list of stories Slide 12 MinHash Probabilistic Clustering technique. Assigns pair of user to the same cluster with probability proportional to the overlap between the set of items users voted for Similarity between users u i and u j represented as Doing this in real-time not scalable Slide 13 MinHash LSI ( Locality Sensitive Hashing) Randomly permute set of items S. For each user u i, compute hash value h( u i ) index of the first item under the permutation that belongs to users item set. Probability that two users will have same hash function is equal to similarity between users. Each hash bucket corresponds to a cluster, which puts two users in the same cluster with similarity S(u i, u j ) Can concatenate more hash keys- underlying clusters more refined and average similarity of users within clusters more. Has high precision but low recall Slide 14 MinHash Improve recall by repeating above step in parallel multiple times Generating random permutations over millions of items not feasible Generate set of independent random seed values, one for each MinHash function- map each news story to a hash-value computed using Id of news story, seed value MapReduce used Handles large amounts of data in short period of time and makes it more scalable Slide 15 PLSI (Probabilistic Latent Semantic Indexing) Models users and items as random variables taking values from space of all possible users and items Relationship learned by modeling as joint distribution of users and items Hidden variable Z captures this relationship where |Z|=L Slide 16 PLSI Mapreducing EM algorithm EM algorithm used to learn maximum likelihood parameters of the model. When the number of users and items are very large of the order millions, memory requirements for the Conditional Probability Distributions is huge. Solution : MapReduce Slide 17 PLSI Map-reduce Framework Slide 18 PLSI R x K grid Users and items sharded into R and K groups respectively Click data corresponding to the (u,s) pair sent to (i,j) th machine u belongs to ith shard and s belongs to jth shard Each machine has to load only (1/R)th of user CPDs and (1/K)th of item CPDs Reducer shard receives key-value pair and computes the values for the next iteration Problems with PLSI : If new user/items added whole model needs to be retrained Even after parallelizing, it is not real time Slide 19 Covisitation Event in which two stories are clicked by the same user within certain time interval Graph Nodes represent items Weighted edges number of Covisitation instances Graph maintained as an adjacency list in Bigtable keyed by item-id (Bigtable maintains data in lexicographic order by row key) Slide 20 Covisitation Fetch every item in users click history, limited to past few hours For every item s i, Lookup for the pair s i, s in the adjacency list for s i stored in the Bigtable. Add the value stored in the entry to the recommendation score normalized by the sum of all entries. Covisitation scores normalized between 0 and 1 by linear scaling Slide 21 System setup Slide 22 System setup (contd..) Two types of requests handled Recommend Request Update Statistics Request Slide 23 Evaluation Precision-recall curves for the Movie Lens Dataset Slide 24 Evaluation Precision-recall curves for the GoogleNews Dataset Slide 25 Evaluation Live traffic click ratios for different algorithms with baseline as Popular algorithm Slide 26 Evaluation Live traffic click ratios for comparing PLSI and MinHash algorithms Slide 27 Conclusion The paper present the algorithms behind scalable real time recommendation engines Novel approaches to clustering over dynamic datasets using MinHash and PLSI with MapReduce Framework for scalability implemented Evaluation on live traffic was done over a large fraction of Google News traffic over a period of several days Slide 28 Future Work Suitable learning techniques to determine how to combine the scores from different algorithms Exploring cost-benefit tradeoffs of using higher order covisitation statistics Slide 29 Questions?