large-scale data processing with hadoop and php (confoo2012 2012-03-02)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

http://confoo.ca/

http://confoo.ca/

David Zuelke

David Zülke

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

http://flic.kr/photos/kevinsteele/230997861/

http://flic.kr/photos/kevinsteele/230997861/

Founder

http://www.bitextender.com/

http://www.bitextender.com/

Lead Developer

http://www.agavi.org/

http://www.agavi.org/

@dzuelke

http://twitter.com/dzuelke


THE BIG DATA CHALLENGEDistributed And Parallel Computing

we want to process data

how much data exactly?

SOME NUMBERS

• Facebook

• New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

• Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

what if you have that much data?

what if you have just 1% of that amount?

“No Problemo”, you say?

reading 180 GB sequentially off a disk will take ~45 minutes

and you only have 16 to 64 GB of RAM per computer

so you can't process everything at once

general rule of modern computers:

data can be processed much faster than it can be read

solution: parallelize your I/O

but now you need to coordinate what you’re doing

and that’s hard

what if a node dies?

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

ENTER: OUR HEROIntroducing MapReduce

in the olden days, the workload was distributed across a grid

and the data was shipped around between nodes

or even stored centrally on something like an SAN

which was fine for small amounts of information

but today, on the web, we have big data

I/O bottleneck

along came a Google publication in 2004

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

now the data is distributed

computing happens on the nodes where the data already is

processes are isolated and don’t communicate (share-nothing)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

• We simply sum up the bytes to get the total traffic per IP!

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

PSEUDOCODE

function map($line_number, $line_text) { $parts = parse_apache_log($line_text); emit($parts['ip'], $parts['bytes']);}

function reduce($key, $values) { $bytes = array_sum($values); emit($key, $bytes);}

212.122.174.13 21023874.119.8.111 99643

212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 19874.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 4374.119.8.111 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272212.122.174.13 -‐ -‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371

A YELLOW ELEPHANTIntroducing Apache Hadoop

http://hadoop.apache.org/

http://hadoop.apache.org/

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Hadoop is a MapReduce framework

it allows us to focus on writing Mappers, Reducers etc.

and it works extremely well

how well exactly?

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes



HADOOP AT FACEBOOK (II)

• Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

• New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)




HADOOP AT YAHOO!

• Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

• Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (contact suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

JOB PROCESSINGHow Hadoop Works

Just like I already described! It’s MapReduce!\o/

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

• Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

HDFSHadoop Distributed File System

HDFS

• Stores data in blocks (default block size: 64 MB)

• Designed for very large data sets

• Designed for streaming rather than random reads

• Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

there’s just one little problem

you need to write Java code

however, there is hope...

STREAMINGHadoop Won’t Force Us To Use Java

Hadoop Streaming can use any script as Mapper or Reducer

many configuration options (parsers, formats, combining, …)

it works using STDIN and STDOUT

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Caution: no separate Reducer processes per key(but keys are sorted)

STREAMING WITH PHPIntroducing HadooPHP

http://github.com/dzuelke/hadoophp


HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

written by

DEMOHadoop Streaming & PHP in Action

!e End

RESOURCES

• Book: Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• Cloudera Distribution: http://www.cloudera.com/hadoop/

• Also: http://www.cloudera.com/developers/learn-hadoop/

• From this talk:

• Logs: http://infochimps.com/datasets/star-wars-kid-data-dump

• HadooPHP: http://github.com/dzuelke/hadoophp

http://www.cloudera.com/hadoop/

http://www.cloudera.com/hadoop/

http://www.cloudera.com/developers/learn-hadoop/

http://www.cloudera.com/developers/learn-hadoop/

http://www.infochimps.com/datasets/star-wars-kid-data-dump

http://www.infochimps.com/datasets/star-wars-kid-data-dump



Questions?

THANK YOU!This was http://joind.in/6072

by @dzuelke.Contact me or hire us:

[email protected]

http://joind.in/6072

http://joind.in/6072



mailto:[email protected]?subject=email%20subject

mailto:[email protected]?subject=email%20subject

large-scale data processing with hadoop and php (confoo2012 2012-03-02)

Technology

number of bytes

big data challengedistributed

bytes wesimply

thisip bytes

list of values

gb of ram

foo http1

bar http1