an introduction of recent research on mapreduce (2011)

26
Outline MAPREDUCE11 Talks in HADOOP WORLD 2010 Other Interesting Papers Paper Introduction Other Papers An Introduction of Recent Research on MapReduce Yu Liu The Graduate University for Advanced Studies July 8th, 2011 Yu Liu An Introduction of Recent Research on MapReduce

Upload: yu-liu

Post on 17-Feb-2017

98 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

An Introduction of Recent Research on MapReduce

Yu Liu

The Graduate University for Advanced Studies

July 8th, 2011

Yu Liu An Introduction of Recent Research on MapReduce

Page 2: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Outline

1 Papers in MAPREDUCE11

2 Talks in HADOOP WOLD 2010

3 Other Interesting Papers

Yu Liu An Introduction of Recent Research on MapReduce

Page 3: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

MAPREDUCE11Sessions

1 Environments and Extensions to the MapReduce ProgramingModel

2 MapReduce Applications

3 Performance and Feature Improvements of MapReduce

4 Keynote by Greg Malewicz, Google Research.: BeyondMapReduce

Yu Liu An Introduction of Recent Research on MapReduce

Page 4: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Paper List

1 Otus: Resource Attribution and Metrics Correlation in DataIntensive Clusters (1)

2 Phoenix++: Modular MapReduce for Shared-Memory Systems (1)

3 Static Type Checking of Hadoop MapReduce Programs (1)

4 Tall and Skinny QR factorizations in MapReduce architectures (2)

5 Rapid Parallel Genome Indexing with MapReduce (2)

6 Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics (2)

7 MapReducing a Genomic Sequencing Workflow (2)

8 Exploring MapReduce Efficiency with Highly-Distributed Data (3)

9 Parallelizing large-scale data processing applications with data (3)skew: a case study in product-offer matching

Yu Liu An Introduction of Recent Research on MapReduce

Page 5: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

The home page

Yu Liu An Introduction of Recent Research on MapReduce

Page 6: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Tyson Condie, et al.:MapReduce Online,NSDI’10

James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et

Yu Liu An Introduction of Recent Research on MapReduce

Page 7: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Otus: Resource Attribution in Data-Intensive Clusters

Authors: Kai Ren, Julio Lopez, Garth Gibson@Carnegie Mellon UniversityBasic content of this paper:An approach for facilitating performance analyses of distributeddata-intensive applications

Background:

Understanding the resource requirements of frameworks likeHadoop, Dryad, etc., and the performance characteristics of theapplications is inherently difficult due to the distributed nature andscale of the computing platform.

Yu Liu An Introduction of Recent Research on MapReduce

Page 8: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Otus: Resource Attribution in Data-Intensive Clusters

Problems:

Traditional cluster monitoring tools fail to provide the necessaryinformation to answer the fundamental questions to understandapplication performance in data-intensive environments.

Solutions:

Attributing the resource utilization to important components ofinterest, in different layers in the cluster software stack. The datais correlated to infer the resource utilization for each servicecomponent and job process in the cluster.

Yu Liu An Introduction of Recent Research on MapReduce

Page 9: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Phoenix++: Modular MapReduce for Shared-MemorySystems

The Phoenix home pageAuthors: Justin Talbot, Richard M. Yoo, Christos Kozyrakis@Computer Systems Laboratory Stanford UniversityBasic content of this paper:Phoenix is a shared-memory implementation of Google’sMapReduce. Phoenix++ is a new implementation and achieves a4.7-fold performance improvement and increased scalability, basedon this paper.

Yu Liu An Introduction of Recent Research on MapReduce

Page 10: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Problems:

Performance issue of Phoenix: it adopts a static MapReducepipeline similar to cluster-based implementations.

Inefficient Key-Value Storage

Ineffective Combiner

Exposed Task Chunking

Solutions:

Abstractions for intermediate data: ContainersMore effective combiner implementation: Combiner ObjectsHide the task chunking granularity

Yu Liu An Introduction of Recent Research on MapReduce

Page 11: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Other Modularity in Phoenix++

Sort is optional.

Custom sorting functions can be defined over key-value pairs

Custom memory allocators.

Yu Liu An Introduction of Recent Research on MapReduce

Page 12: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Static Type Checking of Hadoop MapReduce Programs

Authors: Jens Dorre, Sven Apel, Christian Lengauer@University of Passau, GermanyBasic content of this paper:Provide a static check for Hadoop programs without asking theuser to write any more code.

Background:

Higher-order functions of functional languages can be stronglytyped using parametric polymorphism but in Hadoop, theconnection between the two phases of a MapReduce computationis unsafe: there is no static type check of the generic typeparameters involved.

Yu Liu An Introduction of Recent Research on MapReduce

Page 13: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Problems:

In many MapReduce implementations,MapReduce programs arenot type checked at compile time.

Solutions

A static type checker for Hadoop, using Java 5 compiler.

Users use the combinators to write codes in the main function.

Hadoop job configuration can be generated automatically bythe combinator code.

Yu Liu An Introduction of Recent Research on MapReduce

Page 14: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

The real codes:

Yu Liu An Introduction of Recent Research on MapReduce

Page 15: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Two important functions:

check: uses a chaining combinator to check the interfacebetween the mapper and the combiner function, and anotherone to check the interface between the result and the reducerfunction.

configureTypeSafeJob: Generates the Hadoop jobconfiguration.

Yu Liu An Introduction of Recent Research on MapReduce

Page 16: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Tall and Skinny QR factorizations in MapReducearchitectures

Authors: Paul G. Constantine1, David F. Gleich2

1Sandia National Laboratories,Albuquerque, 2Sandia NationalLaboratories ,LivermoreBasic content of this paper:Implementation of the tall and skinny QR (TSQR) factorization inthe MapReduce framework

Background

Demmel et al derived a communication-avoiding version of the QR(CAQR) factorization trades flops for messages and is ideal forMapReduce, where computationally intensive processes operatelocally on subsets of the data.

Yu Liu An Introduction of Recent Research on MapReduce

Page 17: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

The Implementation

1. multi-Mapper-single-Reducer

2. 2 iterations of Map-Reduce

It seems they don’t know our work... I think we can do better.

Yu Liu An Introduction of Recent Research on MapReduce

Page 18: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Rapid Parallel Genome Indexing with MapReduce

Authors: Rohith K. Menon et al.@Department of Computer Science, Stony Brook UniversityBasic content of this paper:A novel parallel algorithm for constructing the suffix array and theBurrows-Wheeler Transform (BWT) of a sequence leveraging theunique features of the MapReduce parallel programming model.

Yu Liu An Introduction of Recent Research on MapReduce

Page 19: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Full-Text Indexing for Optimizing Selection Operations inLarge-Scale Data Analytics

Authors: Jimmy Lin et al.@TwitterBasic content of this paper:This paper addresses one inefficient aspect of Hadoop-basedprocessing: the need to perform a full scan of the entire dataset,even in cases where it is clearly not necessary to do so. It ispossible to leverage a full-text index to optimize selectionoperations on text fields within records.

Yu Liu An Introduction of Recent Research on MapReduce

Page 20: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

MapReducing a Genomic Sequencing Workflow

Authors: Luca Pireddu et al.@CRS4

Main content

A MapReduce workflow that harnesses Hadoop to post-process thedata produced by DNA sequencing machines.

Yu Liu An Introduction of Recent Research on MapReduce

Page 21: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Exploring MapReduce Efficiency with Highly-DistributedData

Authors: Michael Cardosa et al.@University of MinnesotaBasic content of this paper:Propose recommendations for alternative (and even hierarchical)distributed MapReduce setup configurations, depending on theworkload and data set.

Yu Liu An Introduction of Recent Research on MapReduce

Page 22: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Yu Liu An Introduction of Recent Research on MapReduce

Page 23: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Yu Liu An Introduction of Recent Research on MapReduce

Page 24: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Parallelizing large scale data processing applications withdata skew:a case study in product-offer matching

Authors: Ekaterina Gonina et al.@ UC BerkeleyA case study of parallelizing an example large-scale application(offer matching, a core part of online shopping) on an exampleMapReduce-based distributed computation engine (DryadLINQ).

Yu Liu An Introduction of Recent Research on MapReduce

Page 25: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

Tyson Condie, et al.:MapReduce Online,NSDI’10

James Demmel, et al.:Communication-avoiding parallel andsequential QR factorizations,EECS-2008-74et

Yu Liu An Introduction of Recent Research on MapReduce

Page 26: An Introduction of Recent Research on MapReduce (2011)

OutlineMAPREDUCE11

Talks in HADOOP WORLD 2010Other Interesting Papers

Paper IntroductionOther Papers

The endQuestions?

?

Yu Liu An Introduction of Recent Research on MapReduce