recommender system with hadoop and spark wu, yongkai zhang, fengli

Recommender System with Hadoop and Spark

Wu, YongkaiZhang, Fengli

Hadoop

• Basically parallel processing framework running on cluster of commodity machines• Stateless functional programming because processing of each row of

data does not depend upon any other row or state• Data is replicated, partitioned and resides in Hadoop Distributed File

System (HDFS)• Each partitions get processed by a separated mapper or reducer task

Item-based Collaborative Filtering

• the similarities between different items in the dataset are calculated according to the items’ properties • these similarity values are used to recommend items to users• One problems? If a lot of users like item A and B, but A and B have no similarity such as beer and baby diapers?

Item-Based Recommender using Co-occurrence similarity

Look at User 1 and 2 who havewatched Movie 101, you will seethat there is 2 cooccurrence of Movie 102

Implement of the recommender system• Built history matrix — contains the interactions between users and

items as a user-by-item matrix• Built co-occurrence matrix — transforms the history matrix into an

item-by-item matrix, recording which items appear together in user histories• Do recommendation

HISTORY MATRIX

Map (key, value)key.set (User ID)Value.set (ItemID+“:” + Rating)

<1, (101: 5.0)><1, (102:3.0)>

Combine(key,list( value))

<1, ((101: 5.0), (102:3.0))>

Reduce

Co-occurrence matrix

• transforms the history matrix into an item-by-item matrix, recording which items appear together in user histories

Indicator Matrix

Validation

• Select 20% users from the data set, and assume 20% movies have not been watched

Mahout

• Mahout is a machine-learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster• Command “mahout recommenditembesed --input /movielens.txt

--output recommendations --numRecommendations 10 -- outputPathForSimilarityMatrix similarity-matrix

--similarityClassname SIMILARITY_COOCCURRENCE”

http://mahout.apache.org/users/basics/algorithms.html

12

User-based CF RS with Spark

• Task: to implement a recommendation system using Apache Spark framework• Approach: • user-based collaborative filtering• Similarity metric

• Dataset: MovieLens

13

Spark

14

Spark operators

15

Word Count Example

19

Very easy approach, but…

20

Memory-based Approach

Movielens :1. 10 million ratings applied to 10,000 movies by 72,000 users2. The size of rating matrix: 10 k * 72 k = 720 M

21

memory-based model in distributed platform• Sparse matrix

Alice [(S, 1.5)]

.

.

.

.

Bob [(S, 4.5), (I, 3.5), (C,2) ]

.

.

.

.

Eve [ (I, 1), (C, 2) ]

.

.

.

.

Node 1

Node 2

Node n

22

Similarity metric

loadFile.map().groupByKey()

23

map().groupByKey()

24

map(calcSim()).map().groupByKey()

25

Load history record

loadFile.map().groupByKey()

26

Neighbor recommendation

User [(movie,rating),(movie,rating)]122 [(1,1),(5,1),(22,-1)…..]185 [(1,1),(7,1),(52,-1)…..]

.

.

.

.

User1 [(user2,similarity2),(user3,similarity3)…..]....

Historical data(broadcast)

Similarity model

recommender system with hadoop and spark wu, yongkai zhang, fengli

Documents

groupbykey slide

recommendation slide

cooccurrence slide

fengli slide

indicator matrix slide

user histories slide

item matrix

reducer task slide