an introduction of recent research on mapreduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

An Introduction of Recent Research on MapReduce

Yu Liu

The Graduate University for Advanced Studies

July 8th, 2011

Yu Liu An Introduction of Recent Research on MapReduce

OutlineMAPREDUCE11



Outline

1 Papers in MAPREDUCE11

2 Talks in HADOOP WOLD 2010

3 Other Interesting Papers


OutlineMAPREDUCE11



MAPREDUCE11Sessions

1 Environments and Extensions to the MapReduce ProgramingModel

2 MapReduce Applications

3 Performance and Feature Improvements of MapReduce

4 Keynote by Greg Malewicz, Google Research.: BeyondMapReduce


OutlineMAPREDUCE11



Paper List

1 Otus: Resource Attribution and Metrics Correlation in DataIntensive Clusters (1)

2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1)

3 Static Type Checking of Hadoop MapReduce Programs (1)

4 Tall and Skinny QR factorizations in MapReduce architectures (2)

5 Rapid Parallel Genome Indexing with MapReduce (2)

6 Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics (2)

7 MapReducing a Genomic Sequencing Workflow (2)

8 Exploring MapReduce Efficiency with Highly-Distributed Data (3)

9 Parallelizing large-scale data processing applications with data (3)skew: a case study in product-offer matching


OutlineMAPREDUCE11



The home page


http://www.cloudera.com/company/press-center/hadoop-world-nyc/agenda/

OutlineMAPREDUCE11



Tyson Condie, et al.:MapReduce Online,NSDI’10

James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et


OutlineMAPREDUCE11



Otus: Resource Attribution in Data-Intensive Clusters

Authors: Kai Ren, Julio Lopez, Garth Gibson@Carnegie Mellon UniversityBasic content of this paper:An approach for facilitating performance analyses of distributeddata-intensive applications

Background:

Understanding the resource requirements of frameworks likeHadoop, Dryad, etc., and the performance characteristics of theapplications is inherently difficult due to the distributed nature andscale of the computing platform.


OutlineMAPREDUCE11



Otus: Resource Attribution in Data-Intensive Clusters

Problems:

Traditional cluster monitoring tools fail to provide the necessaryinformation to answer the fundamental questions to understandapplication performance in data-intensive environments.

Solutions:

Attributing the resource utilization to important components ofinterest, in different layers in the cluster software stack. The datais correlated to infer the resource utilization for each servicecomponent and job process in the cluster.


OutlineMAPREDUCE11



Phoenix++: Modular MapReduce for Shared-MemorySystems

The Phoenix home pageAuthors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis@Computer Systems Laboratory Stanford UniversityBasic content of this paper:Phoenix is a shared-memory implementation of Google’sMapReduce. Phoenix++ is a new implementation and achieves a4.7-fold performance improvement and increased scalability, basedon this paper.


http://mapreduce.stanford.edu/

OutlineMAPREDUCE11



Problems:

Performance issue of Phoenix: it adopts a static MapReducepipeline similar to cluster-based implementations.

Inefficient Key-Value Storage

Ineffective Combiner

Exposed Task Chunking

Solutions:

Abstractions for intermediate data: ContainersMore effective combiner implementation: Combiner ObjectsHide the task chunking granularity


OutlineMAPREDUCE11



Other Modularity in Phoenix++

Sort is optional.

Custom sorting functions can be defined over key-value pairs

Custom memory allocators.


OutlineMAPREDUCE11



Static Type Checking of Hadoop MapReduce Programs

Authors: Jens Dorre, Sven Apel, Christian Lengauer@University of Passau, GermanyBasic content of this paper:Provide a static check for Hadoop programs without asking theuser to write any more code.

Background:

Higher-order functions of functional languages can be stronglytyped using parametric polymorphism but in Hadoop, theconnection between the two phases of a MapReduce computationis unsafe: there is no static type check of the generic typeparameters involved.


OutlineMAPREDUCE11



Problems:

In many MapReduce implementations,MapReduce programs arenot type checked at compile time.

Solutions

A static type checker for Hadoop, using Java 5 compiler.

Users use the combinators to write codes in the main function.

Hadoop job configuration can be generated automatically bythe combinator code.


OutlineMAPREDUCE11



The real codes:


OutlineMAPREDUCE11



Two important functions:

check: uses a chaining combinator to check the interfacebetween the mapper and the combiner function, and anotherone to check the interface between the result and the reducerfunction.

configureTypeSafeJob: Generates the Hadoop jobconfiguration.


OutlineMAPREDUCE11



Tall and Skinny QR factorizations in MapReducearchitectures

Authors: Paul G. Constantine1, David F. Gleich2

1Sandia National Laboratories,Albuquerque, 2Sandia NationalLaboratories ,LivermoreBasic content of this paper:Implementation of the tall and skinny QR (TSQR) factorization inthe MapReduce framework

Background

Demmel et al derived a communication-avoiding version of the QR(CAQR) factorization trades flops for messages and is ideal forMapReduce, where computationally intensive processes operatelocally on subsets of the data.


OutlineMAPREDUCE11



The Implementation

1. multi-Mapper-single-Reducer

2. 2 iterations of Map-Reduce

It seems they don’t know our work... I think we can do better.


OutlineMAPREDUCE11



Rapid Parallel Genome Indexing with MapReduce

Authors: Rohith K. Menon et al.@Department of Computer Science, Stony Brook UniversityBasic content of this paper:A novel parallel algorithm for constructing the suffix array and theBurrows-Wheeler Transform (BWT) of a sequence leveraging theunique features of the MapReduce parallel programming model.


OutlineMAPREDUCE11



Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics

Authors: Jimmy Lin et al.@TwitterBasic content of this paper:This paper addresses one inefficient aspect of Hadoop-basedprocessing: the need to perform a full scan of the entire dataset,even in cases where it is clearly not necessary to do so. It ispossible to leverage a full-text index to optimize selectionoperations on text fields within records.


OutlineMAPREDUCE11



MapReducing a Genomic Sequencing Workflow

Authors: Luca Pireddu et al.@CRS4

Main content

A MapReduce workflow that harnesses Hadoop to post-process thedata produced by DNA sequencing machines.


OutlineMAPREDUCE11



Exploring MapReduce Efficiency with Highly-DistributedData

Authors: Michael Cardosa et al.@University of MinnesotaBasic content of this paper:Propose recommendations for alternative (and even hierarchical)distributed MapReduce setup configurations, depending on theworkload and data set.


OutlineMAPREDUCE11




OutlineMAPREDUCE11



Parallelizing large scale data processing applications withdata skew:a case study in product-offer matching

Authors: Ekaterina Gonina et al.@ UC BerkeleyA case study of parallelizing an example large-scale application(offer matching, a core part of online shopping) on an exampleMapReduce-based distributed computation engine (DryadLINQ).


OutlineMAPREDUCE11



Tyson Condie, et al.:MapReduce Online,NSDI’10

James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et


OutlineMAPREDUCE11



The endQuestions?

?


an introduction of recent research on mapreduce (2011)

Technology