introduction to mapreduce using disco

1. VanPyz, June 2, 2009Introduction to MapReduceusing DiscoErlang and Pythonby @JimRoepcke 1

2. Computing at Google ScaleImage Source: http://ischool.tv/news/les/2006/12/computer-grid02s.jpgMassive databases and datastreams need to be processedquickly and reliablyThousands of commodity PCsavailable in Googles clusterfor computationsFaults are statisticallyguaranteed to occur 2 3. Googles MotivationGoogle has thousands of programs to process user-generated dataEven simple computations were being obscured by thecomplex code required to run efciently and reliably ontheir clusters.Engineers shouldnt have to be experts in distributedsystems to write scalable data-processing software.3 4. Why not just use threads?Threads only add concurrency, only on one nodeDoes not scale to > 1 node, a cluster, or a cloudCoordinating work between nodes requires distributionmiddlewareMapReduce is distribution middlewareMapReduce scales linearly with cores / nodes4 5. HadoopApache Foundation projectWritten in JavaIncludes the Hadoop Distributed File System5 6. DiscoCreated by Ville Tuulos of the Nokia Research CenterWritten in Erlang and PythonDoes not include a distributed File SystemProvide your own data distribution mechanism 6 7. How MapReduce works7 8. The big scary diagram... 9. Source: http://labs.google.com/papers/mapreduce-osdi04.pdfUserProgram(1) fork(1) fork(1) fork Master(2) (2)assign assign reducemap workersplit 0 (6) writeoutputsplit 1worker file 0 (5) remote readsplit 2 (3) read(4) local write worker outputsplit 3workerfile 1split 4 workerInputMap Intermediate filesReduce Output files phase(on local disks)phasefiles 9 Figure 1: Execution overview 10. Its truly very simple... 11. Master splits input The (typically huge) input is split into chunks One or more for each map worker 11 12. Splits fed to map workersThe master tells each map worker which split(s) it willprocessA split is a le containing some number of inputrecordsEach record has a key and its associated value12 13. Map each inputThe map worker executes your problem-specic mapalgorithmCalled once for each record in its input 13 14. Map emits (Key,Value) pairs Your map algorithm emits zero or more intermediate key-value pairs for each record processed Lets call these (K,V) pairs from now on Keys and values are both strings14 15. (K,V) Pairs hashed to buckets Each map worker has its own set of buckets Each (K,V) pair is placed into one of these buckets Which bucket is determined by a hash function Advanced: if you know the distribution of your intermediate keys is skewed, provide a custom hash function that distributes (K,V) pairs evenly 15 16. Buckets sent to ReducersOnce all map workers are nished, correspondingbuckets of (K,V) pairs are sent to reduce workersExample: Each map worker placed (K,V) pairs into itsown buckets A, B, and C.Send bucket A from each map to reduce worker 1;Send bucket B from each map to reduce worker 2;Send bucket C from each map to reduce worker 3. 16 17. Reduce inputs sortedThe reduce worker rst concatenates the buckets itreceived into one leThen the le of (K,V) pairs is sorted by KNow the (K,V) pairs are grouped by keyThis sorted list of (K,V) pairs is the input to the reduceworker 17 18. Reduce the list of (K,V) pairs The reduce worker executes your problem-specic reduce algorithm Called once for each key in its input Writes whatever it wants to its output le 18 19. OutputThe output of the MapReduce job is the set of outputles generated by the reduce workersWhat you do with this output is up to youYou might use this output as the input to anotherMapReduce job 19 20. Modied from source: http://labs.google.com/papers/mapreduce-osdi04.pdfExample: Counting wordsdef map (key, value): # key: document name (ignored) # value: words in document (list) for word in value: EmitIntermediate(word, 1)def reduce (key, values): # key: a word # values: a list of counts result = 0 for v in values: result += int(v) print key, result20 21. Stand up! Lets do it! Organize yourselves into approximately equal numbers of map and reduce workers Ill be the master 22. Disco demonstrationWanted to demonstrate a coolpuzzle solver.No go, but I can show the code.Its really simple!Instead you get count_words again,but scaled way up!python count_words.pydisco://localhost

introduction to mapreduce using disco

Documents

v pairs

inputthe map worker

map algorithm

wordsdef map key

worker file

intermediate keyvalue

equal numbers of map

intv print key