hw09 counting and clustering and other data tricks

28
Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009

Upload: cloudera-inc

Post on 20-Aug-2015

1.389 views

Category:

Technology


1 download

TRANSCRIPT

Cheap Parlor Tricks, Counting, and Clustering

Derek GottfridThe New York Times October 2009

Evolution of Hadoop @ NYTimes.com

Early Days - 2007 Solution looking for a problem

SolutionWouldn’t it be cool to use lots of EC2 instances

(it’s cheap; nobody will notice)

Wouldn’t it be cool to use Hadoop

(MapReduce Google style is awesome)

Found a Problem Freeing up historical archives of NYTimes.com 1851-1922

Problem Bits Articles are served as PDFs

Really need PDFs from 1851-1981

PDFs are dynamically generated

Free = more traffic

Real deadline

BackgroundWhat goes into making a PDF of a NYTimes.com article?

Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.

Simple Answer Pre-generate all 11 million PDFs and serve them statically.

Solution Copy all the source data to S3

Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs

Store the output PDFs in S3

Serve the PDFs out of S3 w/ a signed query string

A Few Details Limited HDFS - everything loaded in and out of S3

Reduce = 0 - only used for some stats and error reporting

Breakdown 4.3 TB of source data into S3

11M PDFS - 1.5 TB output

$240 for EC2 - 24hrs x 100 machines

TimesMachinehttp://timesmachine.nytimes.com

Currently - 2009 All that darn data - Web Analytics

Data Registration / Demographic

Articles 1851 - today

Usage Data / Web Logs

Counting Classic cookie tracking - let’s add it up

Total PV

Total unique users

PV per user

A Few Details Using EC2 - 20 Machines

Hadoop 0.20.0

12+TB of data

Straight MR in Java

Usage Data

July 2009

???M Page Views ??M Unique Users

Merging Data Usage data combined with demographic data.

Twitter Click Backs By Age Group

July 2009

Merging Data Usage data with article meta data

Usage Data combined with Article Data

July 2009

40 Articles

Usage Data combined with Article Data

July 2009

40 Articles

Products Coming soon...

Clustering Moving beyond simple counting and joining

Join usage data, demographic information, and article meta data

Apply simple k-means clustering

Clustering

Clustering

Conclusion Large scale computing is transformative for NYTimes.com.

Questions?

[email protected]@derekghttp://open.nytimes.com/