building and improving products with hadoop

Building and

Improving Products

with Hadoop

Matthew Rathbone

What is Foursquare

Foursquare helps you explore the world around you.

Meet up with friends, discover new places, and save money using your phone.

4bn check-ins

35mm users

50mm POI

150 employees

1tb+ a day of data

FIRST, A STORY

http://www.flickr.com/photos/shannonpatrick17

The Right Tool for the Job

• Nginx – Serving static files

• Perl – Regular expressions

• XML – Frustrating people

• Hadoop (Map Reduce) – Counting

COUNTING – WHAT IS IT GOOD FOR

http://www.flickr.com/photos/blaahhi/

Statistically Improbable Phrases

SIPS use cases

• menu extraction

• sentiment analysis

• venue ratings

• specific recommendations

• search indexing

• pricing data

• facility information

How is SIPS built?

Basically lots of counting.

• Tokenize data with a language model (into N-Grams)

• built using tips, shouts, menu items, likes, etc

• Apply a TF-IDF algorithm (Term frequency, inverse document frequency)

• Global phrase count

• Local phrase count ( in a venue )

• Some Filtering and ranking

• Re-compute & deploy nightly

WHY USE HADOOP?

http://www.flickr.com/photos/dbrekke/

SIPS – Without Hadoop

Potential Problems

• Database Query Throttling

• Venues are out of sync

• Altering the algorithm could take forever to populate for all venues

• Where would you store the results?

• What about debug data?

• Does it scale to 10x, 100x?

• What about other, similar workflows?

SIPS – Hadoop Benefits

• Quick Deployment

• Modular & Reusable

• Arbitrarily complex combination of many datasets

• Every step of the workflow creates value

Apple Store - Downtown San Francisco

1 tip mentions "haircuts"

Search for "haircuts" in "san francisco" Apple store???

Fixed by looking at % of tips and overall frequency

“Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-

my-f!@#$-imac”

Data & Modularity

ACTUALLY, IT’S A BIT MORE

COMPLICATED http://www.flickr.com/photos/bfishadow

These benefits require infrastructure

Dependency Management

Many options

• Oozie (Apache)

• Azkaban (LinkedIn)

• Luigi ( Spotify, we <3 this )

• Hamake ( Codeminders )

• Chronos ( AirBNB)

Database / Log Ingestion

• Sqoop

• Mongo-Hadoop

• Kafka

• Flume

• Scribe

• etc

MapReduce Friendly Datastore

A few obvious ones:

• Hbase

• Cassandra

• Voldemort

we built our own, it’s very similar to

Voldemort and uses the Hfile API

Getting started without all that stuff

Components you likely don’t have

The best way to start

Don’t use Hadoop.

*but pretend you do

Other reasons to not use Hadoop

• Your idea might not be very good

• Hadoop will slow you down to start with

• You don’t have enough infrastructure yet

• build it when you need it

• V1 might not be that complex

• V1 could be a spreadsheet

Version 1

• Off the shelf language model

• A subset of Venues & Tips

• Did not use Map Reduce

• Did not push to production at all

Version 2

• Started building our own language model

• Rewritten as a Map Reduce

• Manually loaded data to production

• Filters for English data only.

Tweak, improve, etc

Version 3

• Incorporated more data sources into our language model

• Deployment to KV store (auto)

• Incorporated lots of debug output

• Language pipeline also feeds sentiment analysis

Now we’re in the perfect place to iterate & improve

…to explore data

In Summary

• Hadoop is good for counting, so use it for counting

• Move quickly whenever possible and don’t worry about automation

• Bring in new production services as you need them

• Freedom!

20132013

Thanks!

matthew@foursquare.com

@rathboma

Bonus:

http://hadoopweekly.com

from my colleague, Joe Crobak (presenting later!)

building and improving products with hadoop

Technology

"evidence based" products for improving librarian decision...

improving lives through science-based nutrition buy …...

3 keys to improving your analytic results with hadoop

r challenges in improving agricultural products

bring graph analysis to relational and hadoop data -...

economic assessment for improving eaccessibility services...

improving performance of hadoop clusters

innovations in apache hadoop mapreduce, pig and hive for...

improving mapreduce performance through data placement in...

building data products using hadoop at linkedin - mitul...

improving access to banking products and services …

improving hdfs availability with hadoop rpc quality of...

improving the customer experience with big data wrangling on...

sql-on-hadoop engines explained - r20 mapr sql-2-hadoop may...

improving hadoop performance by using metadata of …

improving the consistency of injection molding products by

ishuffle: improving hadoop performance with...

improving applications of forecasts, advisories and...

2. hadoop -...

improving hadoop mapreduce performance on the fx10 ... ·...