spark
DESCRIPTION
Small presentation about sparkTRANSCRIPT
Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory
Cluster ComputingPresentation by Mário Almeida
OutlineMotivationRDDs overviewSparkData SharingExample : Log MiningFault ToleranceExample : Logistic RegressionRDD RepresentationEvaluationConclusion
Motivation
How to perform large-scale data analytics?● MapReduce ● Dryad
Problem?● reuse intermediate? DFS? ● Pregel? ● How to provide Fault-tolerance
efficiently? Shared memory? key-value stores? Picollo?
Overhead!!
no abstraction for general reuse!!
Fine-grained!!
1
RDDs Overview
Read-only, partitioned collection of records
Created through transformations on data in stable storage or other RDDs
Has information on the lineage of transformations
Control over partitioning and persistence (e.g. non serialized in-memory storage)
2
Spark
Exposes RDDs through a language integrated API.
RDDs can be used in actions.● which return a value or export it to a storage system
(e.g. count, collect and save)
Persist method indicates which RDDs to reuse (default: stored in memory)
3
Data Sharing in MReduce
Overhead: Replication, serialization, disk IO!
4
Data Sharing in Spark
10-100x faster than network and disk
5
Example - Log Mining
Load error messages into memory and search for patterns.
1Tb in 5-7 sec(170 sec for on-disk data)
6
Fault Tolerance
RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.
7
Example - Logistic Regression
Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!
Repeated MapReduce steps to calculate the gradient
One time loaded into memory!
8
Logistic Regression Performance
30Gb set20 * 4 cores w/ 15GBHadoop - 127 s/iterationSpark . 1st iteration 174s, afterwards 6s
9
Representing RDDs
Narrow dependencies allow pipelined
execution
Wide dependencies require data from all
parents
Partition
Wide dependencies are harder to recover!
10
Evaluation - Iteration times
Extra MR job to convert to binary
Heartbeat Protocol
Computation intensive
11
Evaluation - number of machines
25.3x & 20.7x
1.9x & 3.2x
12
Evaluation - Partitioning
Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.
13
Evaluation - Failures
100 GB Working set
14
Conclusion
Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization)
Can interactively scan 1 TB (5-7s latency).
Quick recovery (builds lost RDD partitions).
Pregel/HaLoop can be built on top of Spark.
Good for batch applications that apply the same operation to all elements of a dataset.
15
References
● Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing
● slideshare :/Hadoop_Summit/spark-and-shark