edhrec @ data science md

34

Click here to load reader

Upload: donald-miner

Post on 15-Feb-2017

1.843 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: EDHREC @ Data Science MD

EDHREC, Magic: TG Recommendation Engine

(and data science on games)Donald Miner @donaldpminer

[email protected] 21st, 2015 - Data Science MD Meetup

Games & Stuff in Glen Burnie, MD

Page 2: EDHREC @ Data Science MD

About Don

Page 3: EDHREC @ Data Science MD

About Don, Planeswalker

Page 4: EDHREC @ Data Science MD

Talk agenda Background EDHREC Overview EDHREC Data Analysis EDHREC Architecture Data Science Application UX Lessons Learned Related Work in Magic and Other Domains Virtues of Data Science on Games

Page 5: EDHREC @ Data Science MD

Magic: The Gathering Trading card game First published in 1993 20 million players in 2015 (World of Warcraft has 7.1 million

subscribers) Organized tournaments Secondary market

1993$27,000

Page 6: EDHREC @ Data Science MD

Elder Dragon Highlander / Commander

One of the Magic “formats” Started independently from WOTC late

00’s Officially supported starting 2011 Typically multiplayer 100-card singleton deck

(instead of 60-card, up to 4x copies) Each deck has a single “commander”

(unique to this format)

Page 7: EDHREC @ Data Science MD

Data Science Term coined around 2008

Represents a shift in data analysis in industry

A mix of computer science, machine learning, statistics, programming, visualization, and domain knowledge

Page 8: EDHREC @ Data Science MD

EDHREC Overview

Page 9: EDHREC @ Data Science MD
Page 10: EDHREC @ Data Science MD

EDHREC Deck Recommendations

Page 11: EDHREC @ Data Science MD

EDHREC Commander Stats

Page 12: EDHREC @ Data Science MD

EDHREC Card Stats

Page 13: EDHREC @ Data Science MD

EDHREC Recommendation Engine

Page 14: EDHREC @ Data Science MD

EDHREC Algorithm 1.0User-based Collaborative Filtering

Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/

Analogy:Deck -> UserCard -> Item

Pros:Better at picking up bigger themes in decksEasy to implement

Cons:Had issues discovering subtle deck themesHad issues pointing out combos

Page 15: EDHREC @ Data Science MD

Recommendation Engine 2.0 Algorithm

31,000decks

Decks that contain Sanguine Bond AND Exquisite Blood ÷

Decks that contain Sanguine Bond OR Exquisite Blood

Step 1: Card Affinity Matrix

Jaccard / Tanimoto distance

Repeat for every card combination(15,000 cards)

This is the basis of the Card Analysis pageThis matrix is built offline in batch

Image from http://blog.comsysto.com/2013/04/03/background-of-collaborative-filtering-with-mahout/

Page 16: EDHREC @ Data Science MD

Recommendation Engine 2.0 Algorithm 31,000

decks

1. Select each row of the Tanimoto matrix corresponding to cards in Deck D2. Sum the columns

3. Sort by score, display results

Step 2: Calculate Scores

This gives you a sum of the Tanimoto coefficients

I really have no idea what this algorithm is called… I’m not sure if it’s novel or notThis is performed in real time

Page 17: EDHREC @ Data Science MD

Lessons learned:Taking out the garbage A lot of garbage gets submitted to EDHREC

Decks with <20 cards Decks with invalid commanders Decks with illegal cards

The algorithms handle this well and rarely do problem cards show up

However, pruning “worthless” decks significantly improves performance due to all the O(N^2) algorithms going on

General advice: Think about which pieces of data are worthless in your data set

Page 18: EDHREC @ Data Science MD

Lessons learned:Partitioning (too much or too little) Partitioning the user/deck space into subgroups is a great way to

speed things up in recommendation engines The 31,000 EDHREC decks are partitioned into 27 partitions

(one per possible color combination) Algorithms are ran typically on a single partition

(e.g., Red/Blue deck recommendations only come from other Red/Blue decks)

However, themes that span color combinations suffer worse recommendations

However, partitioning too deep causes problems I tried partitioning by commander, and that was awful:

new commanders, themes than span commanders sufferGeneral advice: There is no good way to figure out a partition scheme, just try it out

Page 19: EDHREC @ Data Science MD

EDHREC Architecture

Page 20: EDHREC @ Data Science MD

Batch Processes (cron)

EDHREC Architecture

New DecksReddit Bot(praw)

New Decks

Pre-calculated

Stats

All Decks

Page 21: EDHREC @ Data Science MD

Batch Processes (cron)

New DecksReddit Bot(praw)

New Decks

Pre-calculated

Stats

All Decks

Redis• In-memory key/value data store

• Stores website state• Utilized as a cache• Stores all of the decks• Stores all of the pre-computed stats• Stores all metadata about Magic cards

• EDHREC serializes most things to common internal json data formats

• Very fast• Very easy to use• Good support with Python

• Getting harder to do “analysis”• Going to move to Redshift SQL

database for analytical things

Page 22: EDHREC @ Data Science MD

Batch Processes (cron)

New DecksReddit Bot(praw)

New Decks

Pre-calculated

Stats

All Decks

Cherrypy• “A Minimalist Python Web Framework”

• Runs the website• Pulls data from Redis and then

renders the results as HTML• Most of the data from Redis is cached

in memory objects (IPC to Redis too slow)

• EDHREC runs 6 of these in parallel behind an NGINX round robin proxy

• Very easy to use, doesn’t get in your way

• Very easy to expose Python data science

• Running into problems with maintainability due to my own sloppiness

Page 23: EDHREC @ Data Science MD

Batch Processes (cron)

New DecksReddit Bot(praw)

New Decks

Pre-calculated

Stats

All Decks

Python• Programming language• Plenty of good libraries for data

analysis:numpy, pandas in this case

• Can handle the “full stack” well(from data analysis to web front end)

• PRAW is a great framework for building Reddit bots

• Most things run every few hours

Page 24: EDHREC @ Data Science MD

Batch Processes (cron)

New DecksReddit Bot(praw)

New Decks

Pre-calculated

Stats

All Decks

Amazon Web Services

• Infrastructure as a Service

• Easily spin up new servers with pre-built operating system

• EDHREC runs on one m4.2xlarge8 CPUs, 32GB RAM, Better network10 cents per hour ($72/month)

• Great for recovering from failures

• Easy to upgrade machine

• Very good uptime so far

• Easy to backup to s3

Page 25: EDHREC @ Data Science MD

Some observations aboutUser Experience and AI applications

Page 26: EDHREC @ Data Science MD

LOL! Look at the dumb bot!

Lesson learned:Humans LOVE pointing out when something the AI is doing is strange or wrong,even if it gets it right 90% of the time. Therefore, I am very conservative of what I end up publishing asI’ve gotten burned a few times. Which can be a shame sometimes.

(just a couple examples)

Page 27: EDHREC @ Data Science MD

The apocalypse is near “EDHREC is ruining EDH/Commander” “EDHREC is taking the fun out of deck construction” “EDHREC kills conversation”

MapQuest takes the fun out of planning trips!

Mostly these are taken as compliments AI is going to have resistance from people who liked the

manual labor I don’t think the commentary entirely off base… but...

Page 28: EDHREC @ Data Science MD

Sometimes too much is too much Over-engineering and doing too much is an easy trap

You want to make it better and provide more “intelligence” Give the users ability to discover and find things

Increases user engagement Better results

Philosophy: EDHREC is a tool, not a solution I’m starting to see my other data science projects this way

Lesson learned:Spend more time on interactive “discovery tools”than intelligent do-everything algorithms

Page 29: EDHREC @ Data Science MD

Interesting related things to look at

Page 30: EDHREC @ Data Science MD

RoboRosewater Rosewater is the name of the Magic lead designer RoboRosewater is a “backwards” neural network,

trained on Magic cards

Page 31: EDHREC @ Data Science MD

MTG Finance

Lots of analysis around Magic finance!

mtgstocks.com

Page 32: EDHREC @ Data Science MD

Diablo 3 build clustering

Page 33: EDHREC @ Data Science MD

Virtues of this whole thingCommunity Most hobbies are defined by communities Technology can bring communities together

Self-Development Data has value and getting data of value is hard Hobby-based data is relatively easy to acquire (compared to say data used by

health care companies) A great way to do real data science on real data (opposed to synthetic data on a

more valuable data set)

Profit! Hobbyists are passionate about their hobby and willing to spend money on it They will pay for and support services they like

Page 34: EDHREC @ Data Science MD

EDHREC, Magic: TG Recommendation Engine

(and data science on games)Donald Miner @donaldpminer

[email protected] 21st, 2015 - Data Science MD Meetup

Games & Stuff in Glen Burnie, MD