map reduce vs spark
TRANSCRIPT
MapReduce vs/and Spark Tudor Lapusan
BigData Romanian Tour - Timisoara
History
MapReduce basic functionalities
● Fault tolerance
● Monitoring & status updates
● Scalability
Hadoop MapReduce
Input Map Reduce Output
Hadoop MapReduce
Input Map Shuffle Reduce Output
MapReduce DAG
A
D
B
C
E
F
Spark
● RDD● Operations : Transformations and Actions
RDD - Resilient Distributed Dataset
RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.
RDD
Scala code
val data = Array(1, 2, 3, 4, 5, 6, 7, 8)
val rddData = sc.parallelize(data)
RDD
Scala code
val rddFile = sc.textFile("data.txt")
RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP
Transformations
RDD 1
RDD 2
Transformations are operations on RDDs that return new RDDs
TransformationsRDD 1
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
ActionsRDD 1
Actions are the operations on RDD which return a final value or write the data to an external storage system.
RDD 1
ActionsRDD 1
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()
Spark DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
ActionTransformation
Stage
Spark DAG vs MapReduce DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
A
B
D C
E
F
Programing languages
MapReduce Java Ruby Perl
PythonPHP
RC++
SparkJavaScala
Python
Easy of use
- Spark is easier to program and include an interactive mode.
- Hadoop MapReduce is harder to program but many tools are available to make it easier.
Performance : Sort Benchmark 2013
Performance : Sort Benchmark 2014
Costs
Costs : hardware recommendation
Spark MapReduce Hadoop
Cores 8-16 4
Memory 8GB to hundreds of GB 24GB
Disks 4-8 4-6 one-TB disks
Network 10GB or more 1GB Ethernet
Spark recommendation Hortonworks recommendation
Costs : developers