data processing in the work of nosql? an introduction to hadoop

Data Processing in NoSQL?

An Introduction to Map Reduce

By Dan Harvey

Thursday, 12 April 12

No SQL?

People thinking abouttheir data storage.


Storage Patterns

Denormalization

Sharding / Hashing

Replication

...


Ad Hoc Queries?

Hard to do...


The problem

Low High

Low

High

Query Latency

Que

ry E

ntro

py HadoopIn-memory

?(Online)

(Offline)

Key-value


“The Apache Hadoop project develops open-source software

for reliable, scalable,distributed computing”


MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

[email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 137

Google’s MapReduceThursday, 12 April 12

Distributed Storage




Google, Inc.


1 Introduction







Google, Inc.


1 Introduction







Google, Inc.


1 Introduction







Google, Inc.


1 Introduction







Google, Inc.


1 Introduction







Google, Inc.


1 Introduction







Google, Inc.


1 Introduction




Replicated Blocks


Map, Shuffle, Reduce0DS5HGXFH�&RPSXWDWLRQDO�0RGHO


Data Locality: Map




Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-




Google, Inc.


1 Introduction




AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-




Google, Inc.




Google, Inc.


1 Introduction




MapReduce, Simplified, Data, Processing, Large, Clusters, Jeffrey, Dean, and, Sanjay,

Ghemawat, Google, Inc, ...

MapReduce, programming, model, associ-ated, implementation, processing,

generating, large, data, sets, Users, ...

Map Stage


Shuffle

MapReduce, Simplified, Data, Processing, Large, Clusters, Jeffrey, Dean, and, Sanjay,

Ghemawat, Google, Inc, ...

MapReduce, programming, model, associated, implementation, processing,

generating, large, data, sets, Users, ...

and, associated, clusters, data, data, dean, implementation, inc, jeffrey, large, large, mapreduce, mapreduce, model, generating, ghemawat, google, google, ...

programming, processing, processing, sanjay, simplified, sets, users, ...


Reduce

and, associated, clusters, data, data, dean, implementation, inc, jeffrey, large, large, mapreduce, mapreduce, model, generating, ghemawat, google, google, ...

programming, processing, processing, sanjay, simplified, sets, users, ...

and, 1associated, 1clusters, 1data, 2...mapreduce, 2model, 1,generating, 1ghemawat, 1google, 2

programming, 1processing, 2sanjay, 1...


So Hadoop?

• Map Reduce for processing

• Distributed File System (HDFS) for storage


Other projects


Uses

• Log analysis

• Data Warehouse

• Natural Language Processing

• Search indexes

• Machine Learning

• ...


1.0 and beyond

• Next generation Hadoop - YARN

• Not just MapReduce

• General resource allocation framework


Users


Want more?

www.meetup.com/Hadoop-Users-Group-UK/

(April 25th)

BigDataWeek.com

(April 23rd - 29th)


data processing in the work of nosql? an introduction to hadoop

Technology