aws, hadoop and mahout – video game recommender ben gooding university of arkansas – department...
TRANSCRIPT
AWS, HADOOP AND MAHOUT – VIDEO
GAME RECOMMENDERBEN GOODING
UNIVERSITY OF ARKANSAS – DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PRESENTED - APRIL 30, 2015
MAHOUT
• Pronounced like Trout
• Open Source Machine Learning platform from Apache
• Used Mahout 0.9
RECOMMENDER TYPES
• Item-Item Based Recommenders• How similar items are to items
• User Based Recommenders• Based on the notion of some similarity between users
SIMILARITIES
• Euclidean Distance Similarity
• 1/(1+d) where d is the distance between two users
• Co-occurrence Similarity
• Explained by previous presentations
• Tanimoto Coefficient
• Ignores user preference numbers, only cares that a user has a preference
• Loglikelihood Similarity
• Based on # of items in common but is an expression of how unlikely two users are to have a similar interest
• Pearson Correlation Similarity
• # between -1 and 1. Measures tendency of two numbers when paired to move together
• High correlation the similarity is close to 1. Opposite, close to -1
THE DATASET
• 228,570 Users
• 21,025 Games
• 463,669 Reviews
• Dataset contained excess information.
• Stanford provided Python script to parse data, but not enough parsing.
• Modified Python script to parse out everything except User ID, Product ID, and Review Score
• Eliminated unknown user names
• Used G-Edit to remove some other excess information
• Wrote a C++ program to convert the User and Product IDs into numerical values
USER BASED NEAREST-N RECOMMENDER EVALUATIONSimilarity n=1 n=2 n=4 n=8 n=16 n=32 n=64 n=128
Euclidean NaN 0.205 0.284 0.361 0.498 0.542 0.604 0.646
Pearson NaN 0.799 0.868 0.886 0.878 0.904 0.960 0.989
Log-likelihood
NaN 0.526 0.771 0.769 0.766 0.808 0.784 0.718
Tanimoto NaN 0.723 0.955 0.826 0.792 0.807 0.822 0.755
USER BASED NEIGHBOR THRESHOLD RECOMMENDER EVAULATIONSimilarity t = 0.95 t = 0.9 t=0.85 t=0.8 t=0.75 t=0.7
Euclidean 0.503 0.503 0.503 0.503 0.503 0.504
Pearson 0.689 0.689 0.665 0.639 0.629 0.703
Log-likelihood
0.801 0.779 0.791 0.796 0.790 0.796
Tanimoto NaN NaN NaN NaN NaN NaN
ITEM BASED RECOMMENDER EVALUATION
Similarity Score
Euclidean 0.786
Pearson 0.944
Log-likehood 0.789
Tanimoto 0.783
HADOOP
• Distributed File System
• Difficult to setup without an easy to understand tutorial
• Got working on my virtual machine
• Couldn’t get Mahout to work with Hadoop as a single node cluster
• Java Class Not Found Exception
AMAZON WEB SERVICES
• Provides Elastic Map Reduce clusters
• Pre-installed with Mahout and Hadoop
• Used 1 Master Node and 3 Slaves
• Utilized the AWS Command Line Interface
AWS RECOMMENDER
• Took roughly 10-20 minutes to produce all of the recommendations.
• Used the item based recommender
• No distributed Generic User Based Recommender
• Generated recommendations for the users
• Utilized a Python based web server to display recommendations
• Input user id, spits out recommendations