scalable collaborative filtering recommendation algorithms on apache spark

Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan CaseyTaptech - 6/6/2014

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce

● Recommender Systems○ Similarity-based collaborative filtering○ Distributed implementation on Apache Spark○ Lessons learned

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph algorithms ○ Stream processing ○ Scalable ML ○ Recommendation engines!

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables● Demo

What are recommendation algorithms?● Problem:

○ “Information overload”○ Diverse user interests

● User-Item Recommendation○ Recommend content for each user

based on a larger training set of user interaction histories

Motivation● Large-scale recommender systems

○ Millions of users and items (100m+ ratings)● Problems:

○ Memory-based approach○ Scalability/Efficiency○ User interaction sparsity

Collaborative Filtering

4 3 8 9

● Similarity based approach

● Two main variants:○ User-based○ Item-based

User-based Collaborative Filtering

● Step 1: Obtain user-itemmatrix denoted Mi,j

● Step 2:Calculate similarity between pairwise users and compute top-n nearest neighbors

pairwise users

rating vectors

● Step 3:Compute weighted average of the ratings by the neighbors and find the top-n items with the score

recommendation score of item

pairwise user similarities

mean rating

co-rated user rating

ResultsStandalone Cluster: Amazon EC2 Cluster:

Evaluation

Lessons Learned● Must manually specify number of tasks

○ Want 2-4 slices for each CPU in your cluster● Use broadcast variables for shared data and cache for

data that will be reused● Must account for the “power users”

○ Sampling heavy tailed user-interaction histories● Need to account for the rating scale of each user!

○ Adjusted cosine similarity and pearson correlation outperform normal cosine similarity

scalable collaborative filtering recommendation algorithms on apache spark

userbased item

user similarities

collaborative filtering

corated user rating

useritem matrix denoted

apache spark lessons

shared data

recommendation engines

Technology

apache spark session

new architectures for apache spark tm and big data · new...

running apache spark & apache zeppelin in production

apache spark rdds

apache spark operations

a tutorial on apache spark - michael...

apache spark - lmu

apache spark 2.0

spark sql | apache spark

writing apache spark and apache flink applications using...

integrating apache hive with kafka, spark, and...

state of security: apache spark & apache zeppelin

performance-analyse von apache spark und apache...

knime extension for apache spark installation guide ·...

apache spark - yandex

apache spark introduction

using apache spark

apache spark & hadoop

apache spark 101

· engines on apache spark rui vieira software engineer ....