scalable collaborative filtering recommendation algorithms on apache spark

14
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark Evan Casey Taptech - 6/6/2014

Upload: evancasey

Post on 26-Jan-2015

111 views

Category:

Technology


2 download

DESCRIPTION

Presentation on scalable collaborative filtering algorithms on Apache Spark given at the the Tapad Taptalk on 6/6/2014

TRANSCRIPT

Page 1: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014

Page 2: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce

● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned

Page 3: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!

Page 4: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Page 5: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables● Demo

Page 6: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

What are recommendation algorithms?● Problem:

○ “Information overload”○ Diverse user interests

● User-Item Recommendation○ Recommend content for each user

based on a larger training set of user interaction histories

Page 7: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Motivation● Large-scale recommender systems

○ Millions of users and items (100m+ ratings)● Problems:

○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity

Page 8: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Collaborative Filtering

Shawn

Billy

Mary

4 3 8 9

2

4

3 4

1

2

8 8

4

● Similarity based approach

● Two main variants:○ User-based○ Item-based

?? ?

?

?

Page 9: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

User-based Collaborative Filtering

● Step 1: Obtain user-itemmatrix denoted Mi,j

Page 10: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

User-based Collaborative Filtering

● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors

pairwise users

rating vectors

Page 11: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

User-based Collaborative Filtering

● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score

recommendation score of item

pairwise user similarities

mean rating

co-rated user rating

Page 12: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

ResultsStandalone Cluster: Amazon EC2 Cluster:

Page 13: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Evaluation

Page 14: Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark

Lessons Learned● Must manually specify number of tasks

○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for

data that will be reused● Must account for the “power users”

○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!

○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity