an introduction to mapreduce

AN INTRODUCTION TO

MAPREDUCE

http://confoo.ca/

http://confoo.ca/

David Zülke

David Zuelke

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

http://en.wikipedia.org/wiki/File:M

http://en.wikipedia.org/wiki/File:M

Founder

http://www.bitextender.com/

http://www.bitextender.com/

Lead Developer

http://www.agavi.org/

http://www.agavi.org/

@dzuelke

http://twitter.com/dzuelke


BEFORE WE BEGIN...Installing Prerequisites

I brought a pre-configured VM

(I stole it from the nice folks over at Cloudera)

http://www.cloudera.com/

http://www.cloudera.com/

to save some time

• /cloudera-training-0.3.2/

• VMWare for Windows, Linux (i386 or x86_64) or Mac OS from /vmware/ if you don’t have it.

• For Fusion, go to vmware.com and get an evaluation key.

• /php/

PLEASE COPY FROM THE HD

http://www.vmware.com/

http://www.vmware.com/

(but be so kind as to pretend to be still listening)

FROM 30.000 FEETDistributed And Parallel Computing

we want to process data

how much data exactly?

SOME NUMBERS

• Google

• Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

• Facebook

• New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

what if you have that much data?

what if you have just 1% of that amount?

“no problemo”, you say?

reading 180 GB sequentially off a disk will take ~45 minutes

but you only have 16 GB or so of RAM per computer

data can be processed much faster than it can be read

solution: parallelize your I/O

but now you need to coordinate what you’re doing

and that’s hard

what if a node dies?

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

ENTER: OUR HEROIntroducing MapReduce

in the olden days, the workload was distributed across a grid

but the data was shipped around between nodes

or even stored centrally on something like an SAN

I/O bottleneck

Google made a publication in 2004

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

http://labs.google.com/papers/mapreduce.html

now the data is distributed

computing happens on the nodes where the data already is

processes are isolated and don’t communicate (share-nothing)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

• We simply sum up the bytes to get the total traffic per IP!

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

PSEUDOCODE

function map($line_number, $line_text) { $parts = parse_apache_log($line_text); emit($parts['ip'], $parts['bytes']);}

function reduce($key, $values) { $bytes = array_sum($values); emit($key, $bytes);}

212.122.174.13 21023874.119.8.111 99643

212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 19874.119.8.111 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 4374.119.8.111 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371

FINGER EXERCISELet’s Try PHP First

HANDS-ONTime To Write Some Code!

ANOTHER ELEPHANTIntroducing Apache Hadoop

http://hadoop.apache.org/

http://hadoop.apache.org/

Hadoop is a MapReduce framework

it allows us to focus on writing Mappers, Reducers etc.

and it works extremely well

how well exactly?

HADOOP AT FACEBOOK

• Predominantly used in combination with Hive (~95%)

• 4800 cores with 12 TB of storage per node

• Per day:

• 4 TB of new data (compressed)

• 135 TB of data scanned (compressed)

• 7500+ Hive jobs per day, ~80k compute hourshttp://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive

http://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive

http://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive

HADOOP AT YAHOO!

• Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

• Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

there’s just one little problem

it’s written in Java

however, there is hope...

STREAMINGHadoop Won’t Force Us To Use Java

Hadoop Streaming can use any script as Mapper or Reducer

many configuration options (parsers, formats, combining, …)

it works using STDIN and STDOUT

Mappers are streamed the records(usually by line: <byteoffset>\t<line>\n)

and emit key/value pairs: <key>\t<value>\n

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Caution: no separate Reducer processes per key(but keys are sorted)

HANDS-ONLet’s Say Hello To Our Hadoop VM

THE HADOOP ECOSYSTEMA Little Tour

APACHE AVROEfficient Data Serialization System With Schemas

(compare: Facebook’s Thrift)

APACHE CHUKWADistributed Data Collection System

(compare: Facebook’s Scribe)

APACHE HBASELike Google’s BigTable, Only That You Can Have It, Too!

HDFSYour Friendly Distributed File

System

HIVEData Warehousing Made

Simple With An SQL Interface

PIGA High-Level Language For Modelling Data Processing Tasks

ZOOKEEPERYour Distributed Applications,

Coordinated

!e End

Questions?

THANK YOU!This was

http://joind.in/1394by

@dzuelke

http://joind.in/1394

http://joind.in/1394



an introduction to mapreduce

Technology

tb of data

tb of new data

tn tn tn tn

number of bytes

hadoop streaming

hadoop vm

bytes wesimply

evaluation key