map reduce vs spark

MapReduce vs/and Spark Tudor Lapusan

BigData Romanian Tour - Timisoara

History

MapReduce basic functionalities

● Fault tolerance

● Monitoring & status updates

● Scalability

Hadoop MapReduce

Input Map Reduce Output

Hadoop MapReduce

Input Map Shuffle Reduce Output

MapReduce DAG

A

D

B

C

E

F

Spark

● RDD● Operations : Transformations and Actions

RDD - Resilient Distributed Dataset

RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.

RDD

Scala code

val data = Array(1, 2, 3, 4, 5, 6, 7, 8)

val rddData = sc.parallelize(data)

RDD

Scala code

val rddFile = sc.textFile("data.txt")

RDD persistence

MEMORY_ONLY

MEMORY_AND_DISK

MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

MEMORY_ONLY_2

MEMORY_AND_DISK_2

OFF_HEAP

Transformations

RDD 1

RDD 2

Transformations are operations on RDDs that return new RDDs

TransformationsRDD 1

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

ActionsRDD 1

Actions are the operations on RDD which return a final value or write the data to an external storage system.

RDD 1

ActionsRDD 1

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

count()=6 take(2)={1,2} saveAsTextFile()

Spark DAG

RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

ActionTransformation

Stage

Spark DAG vs MapReduce DAG

RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

A

B

D C

E

F

Programing languages

MapReduce Java Ruby Perl

PythonPHP

RC++

SparkJavaScala

Python

Easy of use

- Spark is easier to program and include an interactive mode.

- Hadoop MapReduce is harder to program but many tools are available to make it easier.

Performance : Sort Benchmark 2013

Performance : Sort Benchmark 2014

Costs : hardware recommendation

Spark MapReduce Hadoop

Cores 8-16 4

Memory 8GB to hundreds of GB 24GB

Disks 4-8 4-6 one-TB disks

Network 10GB or more 1GB Ethernet

Spark recommendation Hortonworks recommendation

Costs : developers

Questions

[email protected]@tlapusan

mailto:[email protected]

mailto:[email protected]

map reduce vs spark

Data & Analytics