utah distributed systems meetup and reading …map reduce & spark map reduce mapreduce: simplied...

Map Reduce & Spark

Utah Distributed Systems Meetup andReading Group - Map Reduce and Spark

JT Olds

Space MonkeyVivint R&D

January 19 2016

Map Reduce & Spark

Outline

1 Map Reduce

2 Spark

3 Conclusion?

Map Reduce & Spark

Map Reduce

Outline

1 Map Reduce

2 Spark

3 Conclusion?

Map Reduce & Spark

Map Reduce

Map Reduce

1 Map ReduceContextOverall ideaExamplesArchitectureChallenges

Map Reduce & Spark

Map Reduce

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

[email protected], [email protected]

Google, Inc.

Abstract

MapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.

Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 137

Map Reduce & Spark

Map Reduce

Context

Map Reduce


Map Reduce & Spark

Map Reduce

Context

Google Map Reduce context

Lots of conceptually simple tasksOn an internet’s worth of dataSpread across thousands of commodity serversThat are constantly failing

Map Reduce & Spark

Map Reduce

Context

Google Map Reduce context: abstraction?

Parallelize the computationDistribute the dataHandle failuresWith simple code

Map Reduce & Spark

Map Reduce

Overall idea

Map Reduce


Map Reduce & Spark

Map Reduce

Overall idea

Google Map Reduce: Map

map (n1,d1) → [(k1, v1) , (k2, v2) , ...]

Map Reduce & Spark

Map Reduce

Overall idea

Google Map Reduce: Map

map (n1,d1) → [(k1, v1) , (k2, v2) , ...]

map (n2,d2) → [(k3, v3) , (k1, v4) , ...]

Map Reduce & Spark

Map Reduce

Overall idea

Google Map Reduce: Reduce

reduce (k1, [v1, v4, ...]) → r1

Map Reduce & Spark

Map Reduce

Overall idea

Google Map Reduce: Reduce

reduce (k1, [v1, v4, ...]) → r1

reduce (k2, [v2, ...]) → r2

Map Reduce & Spark

Map Reduce

Examples

Map Reduce


Map Reduce & Spark

Map Reduce

Examples

Example: Word Count

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");

Map Reduce & Spark

Map Reduce

Examples

Example: Word Count

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

Emit(AsString(result));

Map Reduce & Spark

Map Reduce

Examples

Aside: Combiner

combine(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

EmitIntermediate(key, AsString(result));

Map Reduce & Spark

Map Reduce

Examples

Example: Reverse Web Link Graph

map(String key, String value):// key: source url// value: document contentsfor each link target in value:EmitIntermediate(target, key);

Map Reduce & Spark

Map Reduce

Examples

Example: Reverse Web Link Graph

reduce(String key, Iterator values):// key: target url// values: a list of source urlsEmit(values.serialize());

Map Reduce & Spark

Map Reduce

Examples

Example: Inverted index

map(String key, String value):// key: document id// value: document contentsfor each unique word in value:EmitIntermediate(word, key);

Map Reduce & Spark

Map Reduce

Examples

Example: Inverted index

reduce(String key, Iterator values):// key: word// values: list of document idsEmit(values.serialize());

Map Reduce & Spark

Map Reduce

Examples

Example: Grep

map(String key, String value):// key: document name// value: document contentsfor each line lineno, line in value:if PATTERN matches line:EmitIntermediate(

"%s:%d" % (key, lineno), line);

Map Reduce & Spark

Map Reduce

Examples

Example: Grep

reduce(String key, Iterator values):// key: document/lineno pair// values: line contentsEmit(values.first);

Map Reduce & Spark

Map Reduce

Examples

Example: Distributed sort

map(String key, String value):// key: key// value: recordEmitIntermediate(key, value);

Map Reduce & Spark

Map Reduce

Examples

Example: Distributed sort

reduce(String key, Iterator values):// key: key// values: list of recordsfor each record in records:Emit(record);

Map Reduce & Spark

Map Reduce

Architecture

Map Reduce


Map Reduce & Spark

Map Reduce

Architecture

Architecture

Master with workersMaster pings all workers to track livenessMaster keeps track of map and reduce task state,restarting failed onesMap task output gets written to local disk. So, evencompleted tasks on failed workers need to be restarted.

Map Reduce & Spark

Map Reduce

Architecture

Architecture

Input comes from GFS (triplicate), master tries to schedulemap worker on or near server with input replicate.User configures reduce partitioning, within a partition allkeys are processed in sorted orderOutput often goes back to GFS (output is often the input toanother job)Stragglers a real problem - some jobs are started multipletimes.

Map Reduce & Spark

Map Reduce

Architecture

Stragglers

500 10000

5000

10000

15000

20000

Inpu

t (M

B/s

)

500 10000

5000

10000

15000

20000

Shuf

fle (M

B/s

)

500 1000

Seconds

0

5000

10000

15000

20000

Out

put (

MB

/s)

Done

(a) Normal execution

500 10000

5000

10000

15000

20000

Inpu

t (M

B/s

)

500 10000

5000

10000

15000

20000

Shuf

fle (M

B/s

)

500 1000

Seconds

0

5000

10000

15000

20000

Out

put (

MB

/s)

Done

(b) No backup tasks

500 10000

5000

10000

15000

20000

Inpu

t (M

B/s

)

500 10000

5000

10000

15000

20000

Shuf

fle (M

B/s

)

500 1000

Seconds

0

5000

10000

15000

20000

Out

put (

MB

/s)

Done

(c) 200 tasks killed

Figure 3: Data transfer rates over time for different executions of the sort program

original text line as the intermediate key/value pair. Weused a built-in Identity function as the Reduce operator.This functions passes the intermediate key/value pair un-changed as the output key/value pair. The final sortedoutput is written to a set of 2-way replicated GFS files(i.e., 2 terabytes are written as the output of the program).

As before, the input data is split into 64MB pieces(M = 15000). We partition the sorted output into 4000files (R = 4000). The partitioning function uses the ini-tial bytes of the key to segregate it into one of R pieces.

Our partitioning function for this benchmark has built-in knowledge of the distribution of keys. In a generalsorting program, we would add a pre-pass MapReduceoperation that would collect a sample of the keys anduse the distribution of the sampled keys to compute split-points for the final sorting pass.

Figure 3 (a) shows the progress of a normal executionof the sort program. The top-left graph shows the rateat which input is read. The rate peaks at about 13 GB/sand dies off fairly quickly since all map tasks finish be-fore 200 seconds have elapsed. Note that the input rateis less than for grep. This is because the sort map tasksspend about half their time and I/O bandwidth writing in-termediate output to their local disks. The correspondingintermediate output for grep had negligible size.

The middle-left graph shows the rate at which datais sent over the network from the map tasks to the re-duce tasks. This shuffling starts as soon as the firstmap task completes. The first hump in the graph is for

the first batch of approximately 1700 reduce tasks (theentire MapReduce was assigned about 1700 machines,and each machine executes at most one reduce task at atime). Roughly 300 seconds into the computation, someof these first batch of reduce tasks finish and we startshuffling data for the remaining reduce tasks. All of theshuffling is done about 600 seconds into the computation.The bottom-left graph shows the rate at which sorteddata is written to the final output files by the reduce tasks.There is a delay between the end of the first shuffling pe-riod and the start of the writing period because the ma-chines are busy sorting the intermediate data. The writescontinue at a rate of about 2-4 GB/s for a while. All ofthe writes finish about 850 seconds into the computation.Including startup overhead, the entire computation takes891 seconds. This is similar to the current best reportedresult of 1057 seconds for the TeraSort benchmark [18].A few things to note: the input rate is higher than theshuffle rate and the output rate because of our localityoptimization – most data is read from a local disk andbypasses our relatively bandwidth constrained network.The shuffle rate is higher than the output rate becausethe output phase writes two copies of the sorted data (wemake two replicas of the output for reliability and avail-ability reasons). We write two replicas because that isthe mechanism for reliability and availability providedby our underlying file system. Network bandwidth re-quirements for writing data would be reduced if the un-derlying file system used erasure coding [14] rather thanreplication.

OSDI ’04: 6th Symposium on Operating Systems Design and ImplementationUSENIX Association 145

Map Reduce & Spark

Map Reduce

Challenges

Map Reduce


Map Reduce & Spark

Map Reduce

Challenges

Challenges

Must fit into map/reduce framework.Everything is written to disk, often to GFS, which meanstriplicate. Lots of disk I/O and network I/O.I/O exacerbated when output is the input to another job.

Map Reduce & Spark

Spark

Outline

1 Map Reduce

2 Spark

3 Conclusion?

Map Reduce & Spark

Spark

Spark

2 SparkContextOverall ideaSimple exampleSchedulingMore Examples

Map Reduce & Spark

Spark

Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

University of California, Berkeley

AbstractWe present Resilient Distributed Datasets (RDDs), a dis-tributed memory abstraction that lets programmers per-form in-memory computations on large clusters in afault-tolerant manner. RDDs are motivated by two typesof applications that current computing frameworks han-dle inefficiently: iterative algorithms and interactive datamining tools. In both cases, keeping data in memorycan improve performance by an order of magnitude.To achieve fault tolerance efficiently, RDDs provide arestricted form of shared memory, based on coarse-grained transformations rather than fine-grained updatesto shared state. However, we show that RDDs are expres-sive enough to capture a wide class of computations, in-cluding recent specialized programming models for iter-ative jobs, such as Pregel, and new applications that thesemodels do not capture. We have implemented RDDs in asystem called Spark, which we evaluate through a varietyof user applications and benchmarks.

1 IntroductionCluster computing frameworks like MapReduce [10] andDryad [19] have been widely adopted for large-scale dataanalytics. These systems let users write parallel compu-tations using a set of high-level operators, without havingto worry about work distribution and fault tolerance.

Although current frameworks provide numerous ab-stractions for accessing a cluster’s computational re-sources, they lack abstractions for leveraging distributedmemory. This makes them inefficient for an importantclass of emerging applications: those that reuse interme-diate results across multiple computations. Data reuse iscommon in many iterative machine learning and graphalgorithms, including PageRank, K-means clustering,and logistic regression. Another compelling use case isinteractive data mining, where a user runs multiple ad-hoc queries on the same subset of the data. Unfortu-nately, in most current frameworks, the only way to reusedata between computations (e.g., between two MapRe-duce jobs) is to write it to an external stable storage sys-tem, e.g., a distributed file system. This incurs substantialoverheads due to data replication, disk I/O, and serializa-

tion, which can dominate application execution times.Recognizing this problem, researchers have developed

specialized frameworks for some applications that re-quire data reuse. For example, Pregel [22] is a system foriterative graph computations that keeps intermediate datain memory, while HaLoop [7] offers an iterative MapRe-duce interface. However, these frameworks only supportspecific computation patterns (e.g., looping a series ofMapReduce steps), and perform data sharing implicitlyfor these patterns. They do not provide abstractions formore general reuse, e.g., to let a user load several datasetsinto memory and run ad-hoc queries across them.

In this paper, we propose a new abstraction called re-silient distributed datasets (RDDs) that enables efficientdata reuse in a broad range of applications. RDDs arefault-tolerant, parallel data structures that let users ex-plicitly persist intermediate results in memory, controltheir partitioning to optimize data placement, and ma-nipulate them using a rich set of operators.

The main challenge in designing RDDs is defining aprogramming interface that can provide fault toleranceefficiently. Existing abstractions for in-memory storageon clusters, such as distributed shared memory [24], key-value stores [25], databases, and Piccolo [27], offer aninterface based on fine-grained updates to mutable state(e.g., cells in a table). With this interface, the only waysto provide fault tolerance are to replicate the data acrossmachines or to log updates across machines. Both ap-proaches are expensive for data-intensive workloads, asthey require copying large amounts of data over the clus-ter network, whose bandwidth is far lower than that ofRAM, and they incur substantial storage overhead.

In contrast to these systems, RDDs provide an inter-face based on coarse-grained transformations (e.g., map,filter and join) that apply the same operation to manydata items. This allows them to efficiently provide faulttolerance by logging the transformations used to build adataset (its lineage) rather than the actual data.1 If a parti-tion of an RDD is lost, the RDD has enough informationabout how it was derived from other RDDs to recompute

1Checkpointing the data in some RDDs may be useful when a lin-eage chain grows large, however, and we discuss how to do it in §5.4.

Map Reduce & Spark

Spark

Context

Spark


Map Reduce & Spark

Spark

Context

Resilient Distributed Datasets

Existing frameworks are a poor fit for:Iterative algorithmsInteractive data mining

Map Reduce & Spark

Spark

Context

Resilient Distributed Datasets

Both things can be sped up orders of magnitude by keepingstuff in memory!

Map Reduce & Spark

Spark

Overall idea

Spark


Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea

RDDs are implemented as lightweight objects inside of aScala shell, where each object represents a sequence ofdeterministic transformations on some data.RDDs can only be constructed in the Scala shell byreferencing some files on disk (usually HDFS), or byoperations on other RDDs.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Example)

val lines = sc.textFile("hdfs://path/to/log")val errors = lines.filter(_.startsWith("ERROR"))val timestamps = (errors

.filter(_.contains("HDFS"))

.map(_.split("\t")(3)))

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea

An RDD is a read-only, partitioned collection of records.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Map)

union

groupByKey

join with inputs not co-partitioned

join with inputs co-partitioned

map, filter

Narrow Dependencies: Wide Dependencies:

Figure 4: Examples of narrow and wide dependencies. Eachbox is an RDD, with partitions shown as shaded rectangles.

map to the parent’s records in its iterator method.

union: Calling union on two RDDs returns an RDDwhose partitions are the union of those of the parents.Each child partition is computed through a narrow de-pendency on the corresponding parent.7

sample: Sampling is similar to mapping, except thatthe RDD stores a random number generator seed for eachpartition to deterministically sample parent records.

join: Joining two RDDs may lead to either two nar-row dependencies (if they are both hash/range partitionedwith the same partitioner), two wide dependencies, or amix (if one parent has a partitioner and one does not). Ineither case, the output RDD has a partitioner (either oneinherited from the parents or a default hash partitioner).

5 ImplementationWe have implemented Spark in about 14,000 lines ofScala. The system runs over the Mesos cluster man-ager [17], allowing it to share resources with Hadoop,MPI and other applications. Each Spark program runs asa separate Mesos application, with its own driver (mas-ter) and workers, and resource sharing between these ap-plications is handled by Mesos.

Spark can read data from any Hadoop input source(e.g., HDFS or HBase) using Hadoop’s existing inputplugin APIs, and runs on an unmodified version of Scala.

We now sketch several of the technically interestingparts of the system: our job scheduler (§5.1), our Sparkinterpreter allowing interactive use (§5.2), memory man-agement (§5.3), and support for checkpointing (§5.4).

5.1 Job Scheduling

Spark’s scheduler uses our representation of RDDs, de-scribed in Section 4.

Overall, our scheduler is similar to Dryad’s [19], butit additionally takes into account which partitions of per-

7Note that our union operation does not drop duplicate values.

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

Figure 5: Example of how Spark computes job stages. Boxeswith solid outlines are RDDs. Partitions are shaded rectangles,in black if they are already in memory. To run an action on RDDG, we build build stages at wide dependencies and pipeline nar-row transformations inside each stage. In this case, stage 1’soutput RDD is already in RAM, so we run stage 2 and then 3.

sistent RDDs are available in memory. Whenever a userruns an action (e.g., count or save) on an RDD, the sched-uler examines that RDD’s lineage graph to build a DAGof stages to execute, as illustrated in Figure 5. Each stagecontains as many pipelined transformations with narrowdependencies as possible. The boundaries of the stagesare the shuffle operations required for wide dependen-cies, or any already computed partitions that can short-circuit the computation of a parent RDD. The schedulerthen launches tasks to compute missing partitions fromeach stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on datalocality using delay scheduling [32]. If a task needs toprocess a partition that is available in memory on a node,we send it to that node. Otherwise, if a task processesa partition for which the containing RDD provides pre-ferred locations (e.g., an HDFS file), we send it to those.

For wide dependencies (i.e., shuffle dependencies), wecurrently materialize intermediate records on the nodesholding parent partitions to simplify fault recovery, muchlike MapReduce materializes map outputs.

If a task fails, we re-run it on another node as longas its stage’s parents are still available. If some stageshave become unavailable (e.g., because an output fromthe “map side” of a shuffle was lost), we resubmit tasks tocompute the missing partitions in parallel. We do not yettolerate scheduler failures, though replicating the RDDlineage graph would be straightforward.

Finally, although all computations in Spark currentlyrun in response to actions called in the driver program,we are also experimenting with letting tasks on the clus-ter (e.g., maps) call the lookup operation, which providesrandom access to elements of hash-partitioned RDDs bykey. In this case, tasks would need to tell the scheduler tocompute the required partition if it is missing.

Map (flatMap in Spark)

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Reduce)

union

groupByKey



map, filter










5.1 Job Scheduling




join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:







Reduce (reduceByKey in Spark)

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Transformations)

RDD operations are coarse-grained and high level, likemap, sample, filter, reduceByKey, etc.RDDs are evaluated lazily. Data is processed as late aspossible. The RDD simply records the high level operationorder, dependencies, and data partitions.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Transformations)

lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:

errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning

// HDFS as an array (assuming time is field

// number 3 in a tab-separated format):

errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea

Transformations

map( f : T⇒ U) : RDD[T]⇒ RDD[U]filter( f : T⇒ Bool) : RDD[T]⇒ RDD[T]

flatMap( f : T⇒ Seq[U]) : RDD[T]⇒ RDD[U]sample(fraction : Float) : RDD[T]⇒ RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)]⇒ RDD[(K, Seq[V])]reduceByKey( f : (V,V)⇒ V) : RDD[(K, V)]⇒ RDD[(K, V)]

union() : (RDD[T],RDD[T])⇒ RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])⇒ RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])⇒ RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])⇒ RDD[(T, U)]

mapValues( f : V⇒W) : RDD[(K, V)]⇒ RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)]⇒ RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)]⇒ RDD[(K, V)]

Actions

count() : RDD[T]⇒ Longcollect() : RDD[T]⇒ Seq[T]

reduce( f : (T,T)⇒ T) : RDD[T]⇒ Tlookup(k : K) : RDD[(K, V)]⇒ Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...)

.map(parsePoint).persist()

var w = // random initial vector

for (i <- 1 to ITERATIONS) {

val gradient = points.map{ p =>

p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)

w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20× speedup, as we show in Section 6.1.

3.2.2 PageRank

A more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to α/N + (1− α)∑ci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()

var ranks = // RDD of (URL, rank) pairs


// Build an RDD of (targetURL, float) pairs

// with the contributions sent by each page

val contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))

}

// Sum contributions by URL and get new ranks

ranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)

}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Tuning)

A user can indicate which RDDs may get reused andshould be persisted (usually in-memory, but can be spilledto disk via priority) (persist or cache).A user can also configure the partitioning of the data, andthe algorithm through which nodes are chosen by key(helps join efficiency, etc).

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea (Actions)

Once you have constructed an RDD with approriatetransformations, the user can perform an action.Actions materialize an RDD, fire up the job scheduler, andcause work to be performed.Actions include count, collect, reduce, save.

Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea

Transformations








Actions













w -= gradient

}


3.2.2 PageRank





contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .










}




}



Map Reduce & Spark

Spark

Overall idea

Spark: Overall idea

RDDs are immutable and deterministic.Easy to launch backup workers for stragglers (clear winfrom Map Reduce)Easy to relaunch computations from failed nodes.Computations are computed lazily, which allows for richoptimizations for locality/partitioning by a job scheduler andquery planner.

Map Reduce & Spark

Spark

Simple example

Spark


Map Reduce & Spark

Spark

Simple example

Example: Log querying

val lines = sc.textFile("hdfs://path/to/log")val errors = lines.filter(_.startsWith("ERROR"))errors.cache()

Map Reduce & Spark

Spark

Simple example


errors.count()

Map Reduce & Spark

Spark

Simple example



Map Reduce & Spark

Spark

Simple example


(errors.filter(_.contains("HDFS")).map(_.split("\t")(3)).collect())

Map Reduce & Spark

Spark

Simple example


lines

errors

filter(_.startsWith(“ERROR”))

HDFS errors

time fields

filter(_.contains(“HDFS”)))

map(_.split(‘\t’)(3))

Figure 1: Lineage graph for the third query in our example.Boxes represent RDDs and arrows represent transformations.

lines = spark.textFile("hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

Line 1 defines an RDD backed by an HDFS file (as acollection of lines of text), while line 2 derives a filteredRDD from it. Line 3 then asks for errors to persist inmemory so that it can be shared across queries. Note thatthe argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the clus-ter. However, the user can now use the RDD in actions,e.g., to count the number of messages:

errors.count()

The user can also perform further transformations onthe RDD and use their results, as in the following lines:

// Count errors mentioning MySQL:


// Return the time fields of errors mentioning

// HDFS as an array (assuming time is field

// number 3 in a tab-separated format):

errors.filter(_.contains("HDFS"))

.map(_.split(’\t’)(3))

.collect()

After the first action involving errors runs, Spark willstore the partitions of errors in memory, greatly speed-ing up subsequent computations on it. Note that the baseRDD, lines, is not loaded into RAM. This is desirablebecause the error messages might only be a small frac-tion of the data (small enough to fit into memory).

Finally, to illustrate how our model achieves fault tol-erance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started witherrors, the result of a filter on lines, and applied a fur-ther filter and map before running a collect. The Sparkscheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holdingthe cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter ononly the corresponding partition of lines.

Aspect RDDs Distr. Shared Mem. Reads Coarse- or fine-grained Fine-grained Writes Coarse-grained Fine-grained Consistency Trivial (immutable) Up to app / runtime Fault recovery Fine-grained and low-

overhead using lineage Requires checkpoints and program rollback

Straggler mitigation

Possible using backup tasks

Difficult

Work placement

Automatic based on data locality

Up to app (runtimes aim for transparency)

Behavior if not enough RAM

Similar to existing data flow systems

Poor performance (swapping?)

Table 1: Comparison of RDDs with distributed shared memory.

2.3 Advantages of the RDD Model

To understand the benefits of RDDs as a distributedmemory abstraction, we compare them against dis-tributed shared memory (DSM) in Table 1. In DSM sys-tems, applications read and write to arbitrary locations ina global address space. Note that under this definition, weinclude not only traditional shared memory systems [24],but also other systems where applications make fine-grained writes to shared state, including Piccolo [27],which provides a shared DHT, and distributed databases.DSM is a very general abstraction, but this generalitymakes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

The main difference between RDDs and DSM is thatRDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads andwrites to each memory location.3 This restricts RDDsto applications that perform bulk writes, but allows formore efficient fault tolerance. In particular, RDDs do notneed to incur the overhead of checkpointing, as they canbe recovered using lineage.4 Furthermore, only the lostpartitions of an RDD need to be recomputed upon fail-ure, and they can be recomputed in parallel on differentnodes, without having to roll back the whole program.

A second benefit of RDDs is that their immutable na-ture lets a system mitigate slow nodes (stragglers) by run-ning backup copies of slow tasks as in MapReduce [10].Backup tasks would be hard to implement with DSM, asthe two copies of a task would access the same memorylocations and interfere with each other’s updates.

Finally, RDDs provide two other benefits over DSM.First, in bulk operations on RDDs, a runtime can sched-

3Note that reads on RDDs can still be fine-grained. For example, anapplication can treat an RDD as a large read-only lookup table.

4In some applications, it can still help to checkpoint RDDs withlong lineage chains, as we discuss in Section 5.4. However, this can bedone in the background because RDDs are immutable, and there is noneed to take a snapshot of the whole application as in DSM.

Map Reduce & Spark

Spark

Scheduling

Spark


Map Reduce & Spark

Spark

Scheduling

RDD Representation

of iterations. Thus, in a job with many iterations, it maybe necessary to reliably replicate some of the versionsof ranks to reduce fault recovery times [20]. The usercan call persist with a RELIABLE flag to do this. However,note that the links dataset does not need to be replicated,because partitions of it can be rebuilt efficiently by rerun-ning a map on blocks of the input file. This dataset willtypically be much larger than ranks, because each docu-ment has many links but only one number as its rank, sorecovering it using lineage saves time over systems thatcheckpoint a program’s entire in-memory state.

Finally, we can optimize communication in PageRankby controlling the partitioning of the RDDs. If we spec-ify a partitioning for links (e.g., hash-partition the linklists by URL across nodes), we can partition ranks inthe same way and ensure that the join operation betweenlinks and ranks requires no communication (as eachURL’s rank will be on the same machine as its link list).We can also write a custom Partitioner class to grouppages that link to each other together (e.g., partition theURLs by domain name). Both optimizations can be ex-pressed by calling partitionBy when we define links:

links = spark.textFile(...).map(...)

.partitionBy(myPartFunc).persist()

After this initial call, the join operation between linksand ranks will automatically aggregate the contributionsfor each URL to the machine that its link lists is on, cal-culate its new rank there, and join it with its links. Thistype of consistent partitioning across iterations is one ofthe main optimizations in specialized frameworks likePregel. RDDs let the user express this goal directly.

4 Representing RDDsOne of the challenges in providing RDDs as an abstrac-tion is choosing a representation for them that can tracklineage across a wide range of transformations. Ideally,a system implementing RDDs should provide as richa set of transformation operators as possible (e.g., theones in Table 2), and let users compose them in arbitraryways. We propose a simple graph-based representationfor RDDs that facilitates these goals. We have used thisrepresentation in Spark to support a wide range of trans-formations without adding special logic to the schedulerfor each one, which greatly simplified the system design.

In a nutshell, we propose representing each RDDthrough a common interface that exposes five pieces ofinformation: a set of partitions, which are atomic piecesof the dataset; a set of dependencies on parent RDDs;a function for computing the dataset based on its par-ents; and metadata about its partitioning scheme and dataplacement. For example, an RDD representing an HDFSfile has a partition for each block of the file and knowswhich machines each block is on. Meanwhile, the result

Operation Meaning partitions() Return a list of Partition objects

preferredLocations(p) List nodes where partition p can be accessed faster due to data locality

dependencies() Return a list of dependencies iterator(p, parentIters) Compute the elements of partition p

given iterators for its parent partitions partitioner() Return metadata specifying whether

the RDD is hash/range partitioned

Table 3: Interface used to represent RDDs in Spark.

of a map on this RDD has the same partitions, but appliesthe map function to the parent’s data when computing itselements. We summarize this interface in Table 3.

The most interesting question in designing this inter-face is how to represent dependencies between RDDs.We found it both sufficient and useful to classify depen-dencies into two types: narrow dependencies, where eachpartition of the parent RDD is used by at most one parti-tion of the child RDD, wide dependencies, where multi-ple child partitions may depend on it. For example, mapleads to a narrow dependency, while join leads to to widedependencies (unless the parents are hash-partitioned).Figure 4 shows other examples.

This distinction is useful for two reasons. First, narrowdependencies allow for pipelined execution on one clus-ter node, which can compute all the parent partitions. Forexample, one can apply a map followed by a filter on anelement-by-element basis. In contrast, wide dependen-cies require data from all parent partitions to be availableand to be shuffled across the nodes using a MapReduce-like operation. Second, recovery after a node failure ismore efficient with a narrow dependency, as only the lostparent partitions need to be recomputed, and they can berecomputed in parallel on different nodes. In contrast, ina lineage graph with wide dependencies, a single failednode might cause the loss of some partition from all theancestors of an RDD, requiring a complete re-execution.

This common interface for RDDs made it possible toimplement most transformations in Spark in less than 20lines of code. Indeed, even new Spark users have imple-mented new transformations (e.g., sampling and varioustypes of joins) without knowing the details of the sched-uler. We sketch some RDD implementations below.

HDFS files: The input RDDs in our samples have beenfiles in HDFS. For these RDDs, partitions returns onepartition for each block of the file (with the block’s offsetstored in each Partition object), preferredLocations givesthe nodes the block is on, and iterator reads the block.

map: Calling map on any RDD returns a MappedRDDobject. This object has the same partitions and preferredlocations as its parent, but applies the function passed to

Map Reduce & Spark

Spark

Scheduling

Narrow vs Wide Dependencies

union

groupByKey



map, filter










5.1 Job Scheduling




join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:







Map Reduce & Spark

Spark

Scheduling

Job Scheduling

union

groupByKey



map, filter










5.1 Job Scheduling




join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:







Map Reduce & Spark

Spark

More Examples

Spark


Map Reduce & Spark

Spark

More Examples

Example: Word Count

val m = documents.flatMap(_._2.split("\\s+")

.map(word => (word, "1")))val r = m.reduceByKey(

(a, b) => (a.toInt + b.toInt).toString)

// do we need a combine step?

Map Reduce & Spark

Spark

More Examples

Example: Reverse Index

val m = documents.flatMap(doc =>

(doc._2.split("\\s+").distinct.map(word => (word, doc._1))))

val r = m.reduceByKey((a, b) => a + "\n" + b)

Map Reduce & Spark

Spark

More Examples

Example: Grep

def search(keyword: String) = {(documents.filter(_._2.contains(keyword)).map(_._1).reduce((a, b) => (a + "\n" + b)))

}

Map Reduce & Spark

Spark

More Examples

Example: Page Rank

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = (contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum))}

Map Reduce & Spark

Spark

More Examples

Example: Page Rank

Transformations








Actions













w -= gradient

}


3.2.2 PageRank





contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .










}




}



Map Reduce & Spark

Conclusion?

Outline

1 Map Reduce

2 Spark

3 Conclusion?

Map Reduce & Spark

Conclusion?

Conclusion?

3 Conclusion?

Map Reduce & Spark

Conclusion?

Conclusion?

Can anything really be dead?Is Map Reduce dead?Is Spark the last word?Questions?

Map Reduce & Spark

Meetup Wrap-up

Shameless plug

Map Reduce & Spark

Meetup Wrap-up

Shameless plug

Space Monkey!

Large scale storageDistributed system design and implementationSecurity and cryptography engineeringErasure codesMonitoring and sooo much data

Map Reduce & Spark

Meetup Wrap-up

Shameless plug

Space Monkey!

Come work with us!

utah distributed systems meetup and reading …map reduce & spark map reduce mapreduce: simplied...

Documents