an introduction to mapreduce

79
AN INTRODUCTION TO MAPREDUCE

Upload: david-zuelke

Post on 15-Jan-2015

1.176 views

Category:

Technology


1 download

DESCRIPTION

Presentation given at ConFoo 2010

TRANSCRIPT

Page 1: An Introduction to MapReduce

AN INTRODUCTION TO

MAPREDUCE

Page 2: An Introduction to MapReduce

David Zülke

Page 3: An Introduction to MapReduce

David Zuelke

Page 4: An Introduction to MapReduce
Page 5: An Introduction to MapReduce

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: An Introduction to MapReduce

Founder

Page 8: An Introduction to MapReduce

Lead Developer

Page 11: An Introduction to MapReduce

BEFORE WE BEGIN...Installing Prerequisites

Page 12: An Introduction to MapReduce

I brought a pre-configured VM

Page 13: An Introduction to MapReduce

(I stole it from the nice folks over at Cloudera)

Page 14: An Introduction to MapReduce

to save some time

Page 15: An Introduction to MapReduce

• /cloudera-training-0.3.2/

• VMWare for Windows, Linux (i386 or x86_64) or Mac OS from /vmware/ if you don’t have it.

• For Fusion, go to vmware.com and get an evaluation key.

• /php/

PLEASE COPY FROM THE HD

Page 16: An Introduction to MapReduce

(but be so kind as to pretend to be still listening)

Page 17: An Introduction to MapReduce

FROM 30.000 FEETDistributed And Parallel Computing

Page 18: An Introduction to MapReduce

we want to process data

Page 19: An Introduction to MapReduce

how much data exactly?

Page 20: An Introduction to MapReduce

SOME NUMBERS

• Google

• Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

• Facebook

• New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

Page 21: An Introduction to MapReduce

what if you have that much data?

Page 22: An Introduction to MapReduce

what if you have just 1% of that amount?

Page 23: An Introduction to MapReduce

“no problemo”, you say?

Page 24: An Introduction to MapReduce

reading 180 GB sequentially off a disk will take ~45 minutes

Page 25: An Introduction to MapReduce

but you only have 16 GB or so of RAM per computer

Page 26: An Introduction to MapReduce

data can be processed much faster than it can be read

Page 27: An Introduction to MapReduce

solution: parallelize your I/O

Page 28: An Introduction to MapReduce

but now you need to coordinate what you’re doing

Page 29: An Introduction to MapReduce

and that’s hard

Page 30: An Introduction to MapReduce

what if a node dies?

Page 31: An Introduction to MapReduce

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 32: An Introduction to MapReduce

ENTER: OUR HEROIntroducing MapReduce

Page 33: An Introduction to MapReduce

in the olden days, the workload was distributed across a grid

Page 34: An Introduction to MapReduce

but the data was shipped around between nodes

Page 35: An Introduction to MapReduce

or even stored centrally on something like an SAN

Page 36: An Introduction to MapReduce

I/O bottleneck

Page 37: An Introduction to MapReduce

Google made a publication in 2004

Page 38: An Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 39: An Introduction to MapReduce

now the data is distributed

Page 40: An Introduction to MapReduce

computing happens on the nodes where the data already is

Page 41: An Introduction to MapReduce

processes are isolated and don’t communicate (share-nothing)

Page 42: An Introduction to MapReduce

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 43: An Introduction to MapReduce

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

• We simply sum up the bytes to get the total traffic per IP!

Page 44: An Introduction to MapReduce

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 45: An Introduction to MapReduce

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 46: An Introduction to MapReduce

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 47: An Introduction to MapReduce

PSEUDOCODE

function map($line_number, $line_text) {  $parts = parse_apache_log($line_text);  emit($parts['ip'], $parts['bytes']);}

function reduce($key, $values) {  $bytes = array_sum($values);  emit($key, $bytes);}

212.122.174.13 21023874.119.8.111   99643

212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /foo HTTP/1.1" 200 18271212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /bar HTTP/1.1" 200 191726212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /baz HTTP/1.1" 200 19874.119.8.111   ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /egg HTTP/1.1" 200 4374.119.8.111   ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /moo HTTP/1.1" 200 91272212.122.174.13 ‐ ‐ [30/Oct/2009:18:14:32 +0100] "GET /yay HTTP/1.1" 200 8371

Page 48: An Introduction to MapReduce

FINGER EXERCISELet’s Try PHP First

Page 49: An Introduction to MapReduce

HANDS-ONTime To Write Some Code!

Page 50: An Introduction to MapReduce

ANOTHER ELEPHANTIntroducing Apache Hadoop

Page 52: An Introduction to MapReduce

Hadoop is a MapReduce framework

Page 53: An Introduction to MapReduce

it allows us to focus on writing Mappers, Reducers etc.

Page 54: An Introduction to MapReduce

and it works extremely well

Page 55: An Introduction to MapReduce

how well exactly?

Page 56: An Introduction to MapReduce

HADOOP AT FACEBOOK

• Predominantly used in combination with Hive (~95%)

• 4800 cores with 12 TB of storage per node

• Per day:

• 4 TB of new data (compressed)

• 135 TB of data scanned (compressed)

• 7500+ Hive jobs per day, ~80k compute hourshttp://www.slideshare.net/cloudera/hw09-rethinking-the-data-warehouse-with-hadoop-and-hive

Page 57: An Introduction to MapReduce

HADOOP AT YAHOO!

• Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

• Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 58: An Introduction to MapReduce

there’s just one little problem

Page 59: An Introduction to MapReduce

it’s written in Java

Page 60: An Introduction to MapReduce

however, there is hope...

Page 61: An Introduction to MapReduce

STREAMINGHadoop Won’t Force Us To Use Java

Page 62: An Introduction to MapReduce

Hadoop Streaming can use any script as Mapper or Reducer

Page 63: An Introduction to MapReduce

many configuration options (parsers, formats, combining, …)

Page 64: An Introduction to MapReduce

it works using STDIN and STDOUT

Page 65: An Introduction to MapReduce

Mappers are streamed the records(usually by line: <byteoffset>\t<line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 66: An Introduction to MapReduce

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 67: An Introduction to MapReduce

Caution: no separate Reducer processes per key(but keys are sorted)

Page 68: An Introduction to MapReduce

HANDS-ONLet’s Say Hello To Our Hadoop VM

Page 69: An Introduction to MapReduce

THE HADOOP ECOSYSTEMA Little Tour

Page 70: An Introduction to MapReduce

APACHE AVROEfficient Data Serialization System With Schemas

(compare: Facebook’s Thrift)

Page 71: An Introduction to MapReduce

APACHE CHUKWADistributed Data Collection System

(compare: Facebook’s Scribe)

Page 72: An Introduction to MapReduce

APACHE HBASELike Google’s BigTable, Only That You Can Have It, Too!

Page 73: An Introduction to MapReduce

HDFSYour Friendly Distributed File

System

Page 74: An Introduction to MapReduce

HIVEData Warehousing Made

Simple With An SQL Interface

Page 75: An Introduction to MapReduce

PIGA High-Level Language For Modelling Data Processing Tasks

Page 76: An Introduction to MapReduce

ZOOKEEPERYour Distributed Applications,

Coordinated

Page 77: An Introduction to MapReduce

!e End

Page 78: An Introduction to MapReduce

Questions?

Page 79: An Introduction to MapReduce

THANK YOU!This was

http://joind.in/1394by

@dzuelke