an introduction of recent research on mapreduce (2011)
TRANSCRIPT
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
An Introduction of Recent Research on MapReduce
Yu Liu
The Graduate University for Advanced Studies
July 8th, 2011
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Outline
1 Papers in MAPREDUCE11
2 Talks in HADOOP WOLD 2010
3 Other Interesting Papers
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
MAPREDUCE11Sessions
1 Environments and Extensions to the MapReduce ProgramingModel
2 MapReduce Applications
3 Performance and Feature Improvements of MapReduce
4 Keynote by Greg Malewicz, Google Research.: BeyondMapReduce
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Paper List
1 Otus: Resource Attribution and Metrics Correlation in DataIntensive Clusters (1)
2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1)
3 Static Type Checking of Hadoop MapReduce Programs (1)
4 Tall and Skinny QR factorizations in MapReduce architectures (2)
5 Rapid Parallel Genome Indexing with MapReduce (2)
6 Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics (2)
7 MapReducing a Genomic Sequencing Workflow (2)
8 Exploring MapReduce Efficiency with Highly-Distributed Data (3)
9 Parallelizing large-scale data processing applications with data (3)skew: a case study in product-offer matching
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
The home page
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Tyson Condie, et al.:MapReduce Online,NSDI’10
James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Otus: Resource Attribution in Data-Intensive Clusters
Authors: Kai Ren, Julio Lopez, Garth Gibson@Carnegie Mellon UniversityBasic content of this paper:An approach for facilitating performance analyses of distributeddata-intensive applications
Background:
Understanding the resource requirements of frameworks likeHadoop, Dryad, etc., and the performance characteristics of theapplications is inherently difficult due to the distributed nature andscale of the computing platform.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Otus: Resource Attribution in Data-Intensive Clusters
Problems:
Traditional cluster monitoring tools fail to provide the necessaryinformation to answer the fundamental questions to understandapplication performance in data-intensive environments.
Solutions:
Attributing the resource utilization to important components ofinterest, in different layers in the cluster software stack. The datais correlated to infer the resource utilization for each servicecomponent and job process in the cluster.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Phoenix++: Modular MapReduce for Shared-MemorySystems
The Phoenix home pageAuthors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis@Computer Systems Laboratory Stanford UniversityBasic content of this paper:Phoenix is a shared-memory implementation of Google’sMapReduce. Phoenix++ is a new implementation and achieves a4.7-fold performance improvement and increased scalability, basedon this paper.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Problems:
Performance issue of Phoenix: it adopts a static MapReducepipeline similar to cluster-based implementations.
Inefficient Key-Value Storage
Ineffective Combiner
Exposed Task Chunking
Solutions:
Abstractions for intermediate data: ContainersMore effective combiner implementation: Combiner ObjectsHide the task chunking granularity
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Other Modularity in Phoenix++
Sort is optional.
Custom sorting functions can be defined over key-value pairs
Custom memory allocators.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Static Type Checking of Hadoop MapReduce Programs
Authors: Jens Dorre, Sven Apel, Christian Lengauer@University of Passau, GermanyBasic content of this paper:Provide a static check for Hadoop programs without asking theuser to write any more code.
Background:
Higher-order functions of functional languages can be stronglytyped using parametric polymorphism but in Hadoop, theconnection between the two phases of a MapReduce computationis unsafe: there is no static type check of the generic typeparameters involved.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Problems:
In many MapReduce implementations,MapReduce programs arenot type checked at compile time.
Solutions
A static type checker for Hadoop, using Java 5 compiler.
Users use the combinators to write codes in the main function.
Hadoop job configuration can be generated automatically bythe combinator code.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
The real codes:
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Two important functions:
check: uses a chaining combinator to check the interfacebetween the mapper and the combiner function, and anotherone to check the interface between the result and the reducerfunction.
configureTypeSafeJob: Generates the Hadoop jobconfiguration.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Tall and Skinny QR factorizations in MapReducearchitectures
Authors: Paul G. Constantine1, David F. Gleich2
1Sandia National Laboratories,Albuquerque, 2Sandia NationalLaboratories ,LivermoreBasic content of this paper:Implementation of the tall and skinny QR (TSQR) factorization inthe MapReduce framework
Background
Demmel et al derived a communication-avoiding version of the QR(CAQR) factorization trades flops for messages and is ideal forMapReduce, where computationally intensive processes operatelocally on subsets of the data.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
The Implementation
1. multi-Mapper-single-Reducer
2. 2 iterations of Map-Reduce
It seems they don’t know our work... I think we can do better.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Rapid Parallel Genome Indexing with MapReduce
Authors: Rohith K. Menon et al.@Department of Computer Science, Stony Brook UniversityBasic content of this paper:A novel parallel algorithm for constructing the suffix array and theBurrows-Wheeler Transform (BWT) of a sequence leveraging theunique features of the MapReduce parallel programming model.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics
Authors: Jimmy Lin et al.@TwitterBasic content of this paper:This paper addresses one inefficient aspect of Hadoop-basedprocessing: the need to perform a full scan of the entire dataset,even in cases where it is clearly not necessary to do so. It ispossible to leverage a full-text index to optimize selectionoperations on text fields within records.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
MapReducing a Genomic Sequencing Workflow
Authors: Luca Pireddu et al.@CRS4
Main content
A MapReduce workflow that harnesses Hadoop to post-process thedata produced by DNA sequencing machines.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Exploring MapReduce Efficiency with Highly-DistributedData
Authors: Michael Cardosa et al.@University of MinnesotaBasic content of this paper:Propose recommendations for alternative (and even hierarchical)distributed MapReduce setup configurations, depending on theworkload and data set.
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Parallelizing large scale data processing applications withdata skew:a case study in product-offer matching
Authors: Ekaterina Gonina et al.@ UC BerkeleyA case study of parallelizing an example large-scale application(offer matching, a core part of online shopping) on an exampleMapReduce-based distributed computation engine (DryadLINQ).
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
Tyson Condie, et al.:MapReduce Online,NSDI’10
James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et
Yu Liu An Introduction of Recent Research on MapReduce
OutlineMAPREDUCE11
Talks in HADOOP WORLD 2010Other Interesting Papers
Paper IntroductionOther Papers
The endQuestions?
?
Yu Liu An Introduction of Recent Research on MapReduce