Download - EDHREC @ Data Science MD
EDHREC, Magic: TG Recommendation Engine
(and data science on games)Donald Miner @donaldpminer
[email protected] 21st, 2015 - Data Science MD Meetup
Games & Stuff in Glen Burnie, MD
About Don
About Don, Planeswalker
Talk agenda Background EDHREC Overview EDHREC Data Analysis EDHREC Architecture Data Science Application UX Lessons Learned Related Work in Magic and Other Domains Virtues of Data Science on Games
Magic: The Gathering Trading card game First published in 1993 20 million players in 2015 (World of Warcraft has 7.1 million
subscribers) Organized tournaments Secondary market
1993$27,000
Elder Dragon Highlander / Commander
One of the Magic “formats” Started independently from WOTC late
00’s Officially supported starting 2011 Typically multiplayer 100-card singleton deck
(instead of 60-card, up to 4x copies) Each deck has a single “commander”
(unique to this format)
Data Science Term coined around 2008
Represents a shift in data analysis in industry
A mix of computer science, machine learning, statistics, programming, visualization, and domain knowledge
EDHREC Overview
EDHREC Deck Recommendations
EDHREC Commander Stats
EDHREC Card Stats
EDHREC Recommendation Engine
EDHREC Algorithm 1.0User-based Collaborative Filtering
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/
Analogy:Deck -> UserCard -> Item
Pros:Better at picking up bigger themes in decksEasy to implement
Cons:Had issues discovering subtle deck themesHad issues pointing out combos
Recommendation Engine 2.0 Algorithm
31,000decks
Decks that contain Sanguine Bond AND Exquisite Blood ÷
Decks that contain Sanguine Bond OR Exquisite Blood
Step 1: Card Affinity Matrix
Jaccard / Tanimoto distance
Repeat for every card combination(15,000 cards)
This is the basis of the Card Analysis pageThis matrix is built offline in batch
Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/
Recommendation Engine 2.0 Algorithm 31,000
decks
1. Select each row of the Tanimoto matrix corresponding to cards in Deck D2. Sum the columns
3. Sort by score, display results
Step 2: Calculate Scores
This gives you a sum of the Tanimoto coefficients
I really have no idea what this algorithm is called… I’m not sure if it’s novel or notThis is performed in real time
Lessons learned:Taking out the garbage A lot of garbage gets submitted to EDHREC
Decks with <20 cards Decks with invalid commanders Decks with illegal cards
The algorithms handle this well and rarely do problem cards show up
However, pruning “worthless” decks significantly improves performance due to all the O(N^2) algorithms going on
General advice: Think about which pieces of data are worthless in your data set
Lessons learned:Partitioning (too much or too little) Partitioning the user/deck space into subgroups is a great way to
speed things up in recommendation engines The 31,000 EDHREC decks are partitioned into 27 partitions
(one per possible color combination) Algorithms are ran typically on a single partition
(e.g., Red/Blue deck recommendations only come from other Red/Blue decks)
However, themes that span color combinations suffer worse recommendations
However, partitioning too deep causes problems I tried partitioning by commander, and that was awful:
new commanders, themes than span commanders sufferGeneral advice: There is no good way to figure out a partition scheme, just try it out
EDHREC Architecture
Batch Processes (cron)
EDHREC Architecture
New DecksReddit Bot(praw)
New Decks
Pre-calculated
Stats
All Decks
Batch Processes (cron)
New DecksReddit Bot(praw)
New Decks
Pre-calculated
Stats
All Decks
Redis• In-memory key/value data store
• Stores website state• Utilized as a cache• Stores all of the decks• Stores all of the pre-computed stats• Stores all metadata about Magic cards
• EDHREC serializes most things to common internal json data formats
• Very fast• Very easy to use• Good support with Python
• Getting harder to do “analysis”• Going to move to Redshift SQL
database for analytical things
Batch Processes (cron)
New DecksReddit Bot(praw)
New Decks
Pre-calculated
Stats
All Decks
Cherrypy• “A Minimalist Python Web Framework”
• Runs the website• Pulls data from Redis and then
renders the results as HTML• Most of the data from Redis is cached
in memory objects (IPC to Redis too slow)
• EDHREC runs 6 of these in parallel behind an NGINX round robin proxy
• Very easy to use, doesn’t get in your way
• Very easy to expose Python data science
• Running into problems with maintainability due to my own sloppiness
Batch Processes (cron)
New DecksReddit Bot(praw)
New Decks
Pre-calculated
Stats
All Decks
Python• Programming language• Plenty of good libraries for data
analysis:numpy, pandas in this case
• Can handle the “full stack” well(from data analysis to web front end)
• PRAW is a great framework for building Reddit bots
• Most things run every few hours
Batch Processes (cron)
New DecksReddit Bot(praw)
New Decks
Pre-calculated
Stats
All Decks
Amazon Web Services
• Infrastructure as a Service
• Easily spin up new servers with pre-built operating system
• EDHREC runs on one m4.2xlarge8 CPUs, 32GB RAM, Better network10 cents per hour ($72/month)
• Great for recovering from failures
• Easy to upgrade machine
• Very good uptime so far
• Easy to backup to s3
Some observations aboutUser Experience and AI applications
LOL! Look at the dumb bot!
Lesson learned:Humans LOVE pointing out when something the AI is doing is strange or wrong,even if it gets it right 90% of the time. Therefore, I am very conservative of what I end up publishing asI’ve gotten burned a few times. Which can be a shame sometimes.
(just a couple examples)
The apocalypse is near “EDHREC is ruining EDH/Commander” “EDHREC is taking the fun out of deck construction” “EDHREC kills conversation”
MapQuest takes the fun out of planning trips!
Mostly these are taken as compliments AI is going to have resistance from people who liked the
manual labor I don’t think the commentary entirely off base… but...
Sometimes too much is too much Over-engineering and doing too much is an easy trap
You want to make it better and provide more “intelligence” Give the users ability to discover and find things
Increases user engagement Better results
Philosophy: EDHREC is a tool, not a solution I’m starting to see my other data science projects this way
Lesson learned:Spend more time on interactive “discovery tools”than intelligent do-everything algorithms
Interesting related things to look at
RoboRosewater Rosewater is the name of the Magic lead designer RoboRosewater is a “backwards” neural network,
trained on Magic cards
MTG Finance
Lots of analysis around Magic finance!
mtgstocks.com
Diablo 3 build clustering
Virtues of this whole thingCommunity Most hobbies are defined by communities Technology can bring communities together
Self-Development Data has value and getting data of value is hard Hobby-based data is relatively easy to acquire (compared to say data used by
health care companies) A great way to do real data science on real data (opposed to synthetic data on a
more valuable data set)
Profit! Hobbyists are passionate about their hobby and willing to spend money on it They will pay for and support services they like
EDHREC, Magic: TG Recommendation Engine
(and data science on games)Donald Miner @donaldpminer
[email protected] 21st, 2015 - Data Science MD Meetup
Games & Stuff in Glen Burnie, MD