spark

18
Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Mário Almeida

Upload: mario-almeida

Post on 06-May-2015

836 views

Category:

Documents


4 download

DESCRIPTION

Small presentation about spark

TRANSCRIPT

Page 1: Spark

Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory

Cluster ComputingPresentation by Mário Almeida

Page 2: Spark

OutlineMotivationRDDs overviewSparkData SharingExample : Log MiningFault ToleranceExample : Logistic RegressionRDD RepresentationEvaluationConclusion

Page 3: Spark

Motivation

How to perform large-scale data analytics?● MapReduce ● Dryad

Problem?● reuse intermediate? DFS? ● Pregel? ● How to provide Fault-tolerance

efficiently? Shared memory? key-value stores? Picollo?

Overhead!!

no abstraction for general reuse!!

Fine-grained!!

1

Page 4: Spark

RDDs Overview

Read-only, partitioned collection of records

Created through transformations on data in stable storage or other RDDs

Has information on the lineage of transformations

Control over partitioning and persistence (e.g. non serialized in-memory storage)

2

Page 5: Spark

Spark

Exposes RDDs through a language integrated API.

RDDs can be used in actions.● which return a value or export it to a storage system

(e.g. count, collect and save)

Persist method indicates which RDDs to reuse (default: stored in memory)

3

Page 6: Spark

Data Sharing in MReduce

Overhead: Replication, serialization, disk IO!

4

Page 7: Spark

Data Sharing in Spark

10-100x faster than network and disk

5

Page 8: Spark

Example - Log Mining

Load error messages into memory and search for patterns.

1Tb in 5-7 sec(170 sec for on-disk data)

6

Page 9: Spark

Fault Tolerance

RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.

7

Page 10: Spark

Example - Logistic Regression

Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!

Repeated MapReduce steps to calculate the gradient

One time loaded into memory!

8

Page 11: Spark

Logistic Regression Performance

30Gb set20 * 4 cores w/ 15GBHadoop - 127 s/iterationSpark . 1st iteration 174s, afterwards 6s

9

Page 12: Spark

Representing RDDs

Narrow dependencies allow pipelined

execution

Wide dependencies require data from all

parents

Partition

Wide dependencies are harder to recover!

10

Page 13: Spark

Evaluation - Iteration times

Extra MR job to convert to binary

Heartbeat Protocol

Computation intensive

11

Page 14: Spark

Evaluation - number of machines

25.3x & 20.7x

1.9x & 3.2x

12

Page 15: Spark

Evaluation - Partitioning

Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.

13

Page 16: Spark

Evaluation - Failures

100 GB Working set

14

Page 17: Spark

Conclusion

Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization)

Can interactively scan 1 TB (5-7s latency).

Quick recovery (builds lost RDD partitions).

Pregel/HaLoop can be built on top of Spark.

Good for batch applications that apply the same operation to all elements of a dataset.

15

Page 18: Spark

References

● Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing

● slideshare :/Hadoop_Summit/spark-and-shark