spark

Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory

Cluster ComputingPresentation by Mário Almeida

OutlineMotivationRDDs overviewSparkData SharingExample : Log MiningFault ToleranceExample : Logistic RegressionRDD RepresentationEvaluationConclusion

Motivation

How to perform large-scale data analytics?● MapReduce ● Dryad

Problem?● reuse intermediate? DFS? ● Pregel? ● How to provide Fault-tolerance

efficiently? Shared memory? key-value stores? Picollo?

Overhead!!

no abstraction for general reuse!!

Fine-grained!!

1

RDDs Overview

Read-only, partitioned collection of records

Created through transformations on data in stable storage or other RDDs

Has information on the lineage of transformations

Control over partitioning and persistence (e.g. non serialized in-memory storage)

2

Spark

Exposes RDDs through a language integrated API.

RDDs can be used in actions.● which return a value or export it to a storage system

(e.g. count, collect and save)

Persist method indicates which RDDs to reuse (default: stored in memory)

3

Data Sharing in MReduce

Overhead: Replication, serialization, disk IO!

4

Data Sharing in Spark

10-100x faster than network and disk

5

Example - Log Mining

Load error messages into memory and search for patterns.

1Tb in 5-7 sec(170 sec for on-disk data)

6

Fault Tolerance

RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.

7

Example - Logistic Regression

Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!

Repeated MapReduce steps to calculate the gradient

One time loaded into memory!

8

Logistic Regression Performance

30Gb set20 * 4 cores w/ 15GBHadoop - 127 s/iterationSpark . 1st iteration 174s, afterwards 6s

9

Representing RDDs

Narrow dependencies allow pipelined

execution

Wide dependencies require data from all

parents

Partition

Wide dependencies are harder to recover!

10

Evaluation - Iteration times

Extra MR job to convert to binary

Heartbeat Protocol

Computation intensive

11

Evaluation - number of machines

25.3x & 20.7x

1.9x & 3.2x

12

Evaluation - Partitioning

Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.

13

Evaluation - Failures

100 GB Working set

14

Conclusion

Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization)

Can interactively scan 1 TB (5-7s latency).

Quick recovery (builds lost RDD partitions).

Pregel/HaLoop can be built on top of Spark.

Good for batch applications that apply the same operation to all elements of a dataset.

15

References

● Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing

● slideshare :/Hadoop_Summit/spark-and-shark

spark

Documents

memory storage

disk data

shared memory

data analytics

data instable storage

sparkexposes rdds

rdds overviewread

fault tolerancerdds