mapreduce paradigm

1. MapReduce ParadigmDilip Reddy KancharlaSpring 2012

2. Outline Introduction Motivating example Hadoop Hadoop MapReduce HDFS Pros & Cons of MapReduce Hadoop Applicability to different workflows Conclusions and Future work 3. Critical UserMapReduceProgramExecution ForkFork ForkOverview [DG08] Master Assign Assign MapReduce Key/Value PairsWorker Remote Output Local Worker Split 1 readfile 1 WriteWrite Split 2Worker Split 3 Split 4..Output Split 5 Workerfile 2. . . Worker .OutputIntermediateReduce Input Files Map PhaseOperationsPhase Files 4. MapReduce Paradigm Splits input files into blocks (typically of 64MBeach) Operates on key/value pairs Mappers filter & transform input data Reducers aggregate mappers output Efficient way to process the cluster: Move code to data Run code on all machines 5. Map Hash Function (K1,v1) List(k2,v2) ReduceAggregate Function List(k3,v3) (k2,list(v2)) 6. Advanced MapReduce Hadoop Streaming Lets you stream Mapper and reducer written inother languages such as python, ruby, etc., Chaining MapReduce jobs Joining data Bloom filters 7. Hadoop Open Source Implementation of MapReduce byApache Software Foundation. Created by Doug Cutting. Derived from Googles MapReduce and GoogleFile System (GFS) papers. Apache Hadoop is a software framework thatsupports data-intensive distributed applicationsunder a free license It enables applications to work with thousands ofcomputational independent computers andpetabytes of data. 8. Hadoop Architecture Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (MapReduce),and assigns tasks to worker nodes Hadoop Distributed File System (HDFS) Single name node, many data nodes Files stored as large, fixed-size (e.g. 64MB) blocks HDFS typically holds map input and reduce output 9. Hadoop Architecture Secondary Namenode NamenodeJobTrackerData Data Datanode node nodeTaskTracker TaskTracker TaskTrackerMap Map Map Map MapMap Map Map Map MapMap Map MapReduceMap Map Reduce Reduce 10. Job Scheduling in Hadoop One map task for each block of the input file Applies user-defined map function to each record inthe block Record = User-defined number of reduce tasks Each reduce task is assigned a set of record groups For each group, apply user-defined reduce function tothe record values in that group Reduce tasks read from every map task Each read returns the record groups for that reducetask 11. Dataflow in Hadoop Map tasks write their output to local disk Output available after map task has completed Reduce tasks write their output to HDFS Once job is finished, next jobs map tasks can bescheduled, and will read input from HDFS Therefore, fault tolerance is simple: simply re-run tasks on failure No consumers see partial operator output 12. Dataflow in Hadoop[CAHER10] Submit jobmap schedule reducemapreduce 13. Dataflow in Hadoop[CAHER10]ReadInput File map reduce Block 1HDFS Block 2 map reduce 14. Dataflow in Hadoop[CAHER10] map LocalFSreduce HTTP GET Local mapFSreduce 15. Dataflow in Hadoop[CAHER10]WriteFinal reduceAnswer HDFS reduce 16. HDFS Data is distributed and replicated overmultiple machines. Files are not stored in contiguously on serversbroken up into blocks. Designed for large files (large means GB or TB) Block Oriented Linux Style commands (eg. ls, cp, mkdir, mv) 17. Different Workflows[MTAGS11] 18. Hadoop Applicability by Workflow[MTAGS11]Score Meaning: Score Zero implies Easily adaptable to the workflow Score 0.5 implies Moderately adaptable to theworkflow Score 1 indicates one of the potential workflow areaswhere Hadoop needs improvement 19. Relative Merits and Demerits of Hadoop Over DBMSPros Cons Fault tolerance No high level language like Self Healing rebalances filesSQL in DBMSacross cluster No schema and no index Highly Scalable Low efficiency Highly Flexible as it does not Very young (since 2004)have any dependency on compared to over 40yearsdata model and schemaof DBMS HadoopRelational Scale out (add moreScaling is difficultmachines)Key/Value pairs TablesSay how to process the dataSay what you want (SQL)Offline/ batchOnline/ realtime 20. Conclusions and Future Work MapReduce is easy to program Hadoop=HDFS+MapReduce Distributed, Parallel processing Designed for fault tolerance and high scalability MapReduce is unlikely to substitute DBMS indata warehousing instead we expect them tocomplement each other and help in data analysisof scientific data patterns Finally, Efficiency and especially I/O costs needsto be addressed for successful implications 21. References[LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon DohnChung, and Bongki Moon, Parallel data processing with MapReduce:a survey, SIGMOD, January 2012, pp. 11-20. [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, andLavanya Ramakrishnan, Riding the Elephant: Managing Ensembleswith Hadoop, Proceedings of the 2011 ACM international workshopon Many task computing on grids and supercomputers, ACM, NewYork, NY, USA, pp. 49-58.[DG08]Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplifieddata processing on large clusters, January 2008, pp. 107-113. ACM.[CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M.Hellerstein, Khaled Elmeleegy, and Russell Sears, MapReduce online,Proceedings of the 7th USENIX conference on Networked systemsdesign and implementation (NSDI10), USENIX Association, Berkeley,CA, USA, 2010, pp. 21-37. 22. Thank You!Questions?

mapreduce paradigm

Education

hadoop map tasks

map input

tasks mapreduce

jobs map tasks

hdfs data

demerits of hadoop

apache hadoop

ensembleswith hadoop