map reduce

MapReduceA Gentle Introduction, In Four Acts

Act I

Introduction

• Map is a higher order procedure that takes as its arguments a procedure of one argument and a list.

What is Map

>> l = (1..10)=> 1..10>> l.map { |i| i + 1 }=> [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

• Reduce is a higher order procedure that takes as its arguments a procedure of two arguments and a list.

• Has some other names. In Ruby, it’s inject.

What is Reduce

>> l = (1..10)=> 1..10>> l.inject {|i, j| i + j }=> 55

• An algorithm inspired by map and reduce, used to perform ‘embarassingly parallel’ computations.

• A framework based on that algorithm, used inside Google.• A handy way to deal with large (like, really, really large) amounts of semi-

structured data.

What Is MapReduce

Semi-Structured Data?

The Web Is Kind Of A Mess

But There Is Some Order<html> <head> <title> Marmots I’ve Loved </title> </head> <body> <h1> Marmot List </h1> <ul> <li> Marcy </li> <li> Stacy </li> </ul> </body></html>

12:00:23 GET /marmots/index.html 12:00:55 GET /marmots/stacy.jpg 12:00:67 GET /marmots/marcy.jpg

• So, what do you do if you’ve got gigabytes (or terrabytes) of this sort of data, and you want to analyze it?

• You could buy a distributed data warehouse. Pricy!• And you still need to do ETL for everything.• And you’ve got nulls all over the place.• And maybe your schema changes. A lot.

But What To Do With It?

Act II

Enter Stage Left – MapReduce

• Conceptually, it’s easy to make make map parallel.• If you have 10 million records and 10 nodes, send 1 million records to each node along

with the map code.• That’s it!• Well, not really. It’s a hard engineering problem. (Need a distributed data store to store

results, nodes, fail, and on and on…)

What Is Map, Part Deux

• Reduce is harder, can’t in general split the list up among nodes, and recombine the results. Evaluation order matters!

• (1 / 2 / 3 / 4) != (1 / 2) / (3 / 4)• But what if we constrain ourselves to work only on key-value pairs?• Then we can distribute all the records that correspond to a particular key to the same node,

and get an answer for that key.

What Is Reduce, Part Deux

• Now we’re back in the same place that we are with Map, conceptually easy to make parallel, still a hard engineering problem.

• But how useful is it?

What Is Reduce, Part Deux, Part Deux

MapReduce PseudocodeDistributed Word Count*

*This example is legally required to be in all introductions to MapReduce

map(record) words = split(record, ‘ ‘) for word in words emit(word, 1)

reduce(key, values) int count = 0 for value in values count += 1 emit(key, count)

Act III

Hadoop (Streaming Mode)

Hadoop!

• Apache umbrella project (what isn’t, nowadays?)

• Open source MapReduce implementation, distributed filesystem (HDFS), non-relational data store (HBase), declarative language for processing semi-structured data (Pig).

• I’ve really only used the MapReduce implementation, in ‘Streaming Mode’

MapReduce MapperDistributed Word Count*


#!/usr/bin/ruby

STDIN.each_line do |line| words = line.split(' ') words.each { |word| puts "#{word} 1" }end

MapReduce ReducerDistributed Word Count*


#!/usr/bin/ruby

count = 0current_word = nil

STDIN.each_line do |line| key, value = line.split("\t") current_word = key if nil == current_word if (key != current_word) then puts "#{current_word}\t#{count}" count = 0 current_word = key end

count += value.to_iend

puts "#{current_word}\t#{count}"

Streaming Mode• Jobs read from STDIN, write to STDOUT.• Framework guarantees that a given reduce job will process an entire set of keys (ie: the key

‘marmot’ will not be split across two nodes)• Can use any language you want• Probably pretty slow, with all the STDIN/STDOUTing going on• Probably should use Pig instead

Act IV

Amazon Elastic Map Reduce

So I’ve Got This Pile Of Data,Now What?

Buy A Bunch Of Servers?

Elastic Map Reduce

• Cloudy Hadoop• Pay for processing time by the hour.• Works with streaming mode, regular mode, pig.• Kinda sorta demonstration!

Tips!

• Make sure to turn debugging on! Seriously, otherwise, world of pain.• Don’t use the console for anything complicated. Use the ruby client

(just google it).• For multi-step MRing, don’t write out to S3

map reduce

Technology