hadoop world - oct 2009

28
Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times

Upload: derek-gottfrid

Post on 25-Jan-2015

517 views

Category:

Technology


2 download

DESCRIPTION

Review of the different things that nytimes.com has been up to w/ Hadoop from the simple to the less simple.

TRANSCRIPT

Page 1: Hadoop World - Oct 2009

Cheap Parlor Tricks, Counting, and Clustering

Derek GottfridThe New York Times October 2009

Page 2: Hadoop World - Oct 2009

Evolution of Hadoop @ NYTimes.com

Page 3: Hadoop World - Oct 2009

Early Days - 2007 Solution looking for a problem

Page 4: Hadoop World - Oct 2009

SolutionWouldn’t it be cool to use lots of EC2 instances

(it’s cheap; nobody will notice)

Wouldn’t it be cool to use Hadoop

(MapReduce Google style is awesome)

Page 5: Hadoop World - Oct 2009

Found a Problem Freeing up historical archives of NYTimes.com 1851-1922

Page 6: Hadoop World - Oct 2009

Problem Bits Articles are served as PDFs

Really need PDFs from 1851-1981

PDFs are dynamically generated

Free = more traffic

Real deadline

Page 7: Hadoop World - Oct 2009

BackgroundWhat goes into making a PDF of a NYTimes.com article?

Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.

Page 8: Hadoop World - Oct 2009

Simple Answer Pre-generate all 11 million PDFs and serve them statically.

Page 9: Hadoop World - Oct 2009

Solution Copy all the source data to S3

Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs

Store the output PDFs in S3

Serve the PDFs out of S3 w/ a signed query string

Page 10: Hadoop World - Oct 2009

A Few Details Limited HDFS - everything loaded in and out of S3

Reduce = 0 - only used for some stats and error reporting

Page 11: Hadoop World - Oct 2009

Breakdown 4.3 TB of source data into S3

11M PDFS - 1.5 TB output

$240 for EC2 - 24hrs x 100 machines

Page 12: Hadoop World - Oct 2009

TimesMachinehttp://timesmachine.nytimes.com

Page 13: Hadoop World - Oct 2009

Currently - 2009 All that darn data - Web Analytics

Page 14: Hadoop World - Oct 2009

Data Registration / Demographic

Articles 1851 - today

Usage Data / Web Logs

Page 15: Hadoop World - Oct 2009

Counting Classic cookie tracking - let’s add it up

Total PV

Total unique users

PV per user

Page 16: Hadoop World - Oct 2009

A Few Details Using EC2 - 20 Machines

Hadoop 0.20.0

12+TB of data

Straight MR in Java

Page 17: Hadoop World - Oct 2009

Usage Data

July 2009

???M Page Views ??M Unique Users

Page 18: Hadoop World - Oct 2009

Merging Data Usage data combined with demographic data.

Page 19: Hadoop World - Oct 2009

Twitter Click Backs By Age Group

July 2009

Page 20: Hadoop World - Oct 2009

Merging Data Usage data with article meta data

Page 21: Hadoop World - Oct 2009

Usage Data combined with Article Data

July 2009

40 Articles

Page 22: Hadoop World - Oct 2009

Usage Data combined with Article Data

July 2009

40 Articles

Page 23: Hadoop World - Oct 2009

Products Coming soon...

Page 24: Hadoop World - Oct 2009

Clustering Moving beyond simple counting and joining

Join usage data, demographic information, and article meta data

Apply simple k-means clustering

Page 25: Hadoop World - Oct 2009

Clustering

Page 26: Hadoop World - Oct 2009

Clustering

Page 27: Hadoop World - Oct 2009

Conclusion Large scale computing is transformative for NYTimes.com.

Page 28: Hadoop World - Oct 2009

Questions?

[email protected]@derekghttp://open.nytimes.com/