Download - Resilient Distributed Datasets
![Page 1: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/1.jpg)
Resilient Distributed DatasetsA Fault-Tolerant Abstraction forIn-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica
UC Berkeley UC BERKELEY
![Page 2: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/2.jpg)
MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clustersBut as soon as it got popular, users wanted more:
»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)
»More interactive ad-hoc queries
Response: specialized frameworks for some of these apps (e.g. Pregel for graph
processing)
![Page 3: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/3.jpg)
MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharingIn MapReduce, the only way to share data across jobs is stable storage
slow!
![Page 4: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/4.jpg)
Examplesiter. 1 iter. 2 . .
.Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication and disk I/O,but necessary for fault tolerance
![Page 5: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/5.jpg)
iter. 1 iter. 2 . . .
Input
Goal: In-Memory Data Sharing
Input
query 1
query 2
query 3. . .
one-timeprocessing
10-100× faster than network/disk, but how to get FT?
![Page 6: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/6.jpg)
Challenge
How to design a distributed memory abstraction that is both fault-tolerant
and efficient?
![Page 7: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/7.jpg)
ChallengeExisting storage abstractions have interfaces based on fine-grained updates to mutable state
»RAMCloud, databases, distributed mem, Piccolo
Requires replicating data or logs across nodes for fault tolerance
»Costly for data-intensive apps»10-100x slower than memory write
![Page 8: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/8.jpg)
Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory
»Immutable, partitioned collections of records
»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage»Log one operation to apply to many
elements»Recompute lost partitions on failure»No cost if nothing fails
![Page 9: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/9.jpg)
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-timeprocessing
iter. 1 iter. 2 . . .
Input
![Page 10: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/10.jpg)
Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms
»These naturally apply the same operation to many items
Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),
iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
![Page 11: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/11.jpg)
Memorybandwidth
Networkbandwidth
Tradeoff Space
Granularityof Updates
Write Throughput
Fine
CoarseLow High
K-V stores,databases,RAMCloud
Best for batchworkloads
Best fortransactional
workloads
HDFS RDDs
![Page 12: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/12.jpg)
OutlineSpark programming interfaceImplementationDemoHow people are using Spark
![Page 13: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/13.jpg)
Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreterProvides:
»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build
new RDDs), actions (compute and output results)
»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)
![Page 14: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/14.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count
tasksresults
Msgs. 1
Msgs. 2
Msgs. 3
Base RDDTransformed
RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
![Page 15: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/15.jpg)
RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.:
messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc =
_.contains(...)
MappedRDDfunc = _.split(…)
Fault Recovery
HadoopRDD FilteredRDD MappedRDD
![Page 16: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/16.jpg)
Fault Recovery Results
1 2 3 4 5 6 7 8 9 10020406080
100120140 119
57 56 58 5881
57 59 57 59
Iteration
Iter
atri
on t
ime
(s) Failure happens
![Page 17: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/17.jpg)
Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}
![Page 18: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/18.jpg)
Optimizing Placement
links & ranks repeatedly joinedCan co-partition them (e.g. hash both on URL) to avoid shufflesCan also use app knowledge, e.g., hash on DNS name
links = links.partitionBy( new URLPartitioner())
reduceContribs0
join
joinContribs2
Ranks0(url, rank)
Links(url,
neighbors)
. . .
Ranks2
reduce
Ranks1
![Page 19: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/19.jpg)
PageRank Performance
01020304050607080 72
23
HadoopBasic SparkSpark + Con-trolled Partition-ingTi
me
per
iter
-at
ion
(s)
![Page 20: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/20.jpg)
ImplementationRuns on Mesos [NSDI 11]to share clusters w/ HadoopCan read from any Hadoop input source (HDFS, S3, …)
Spark Hadoop MPI
Mesos
Node Node Node Node
…
No changes to Scala language or compiler»Reflection + bytecode analysis to correctly ship
code
www.spark-project.org
![Page 21: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/21.jpg)
Programming Models Implemented on SparkRDDs can express many existing parallel models
»MapReduce, DryadLINQ»Pregel graph processing [200 LOC]»Iterative MapReduce [200 LOC]»SQL: Hive on Spark (Shark) [in progress]
Enables apps to efficiently intermix these models
All are based oncoarse-grained operations
![Page 22: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/22.jpg)
Demo
![Page 23: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/23.jpg)
Open Source Community15 contributors, 5+ companies using Spark,3+ applications projects at BerkeleyUser applications:
»Data mining 40x faster than Hadoop (Conviva)
»Exploratory log analysis (Foursquare)»Traffic prediction via EM (Mobile Millennium)»Twitter spam classification (Monarch)»DNA sequence analysis (SNAP)». . .
![Page 24: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/24.jpg)
Related WorkRAMCloud, Piccolo, GraphLab, parallel DBs
»Fine-grained writes requiring replication for resiliencePregel, iterative MapReduce
»Specialized models; can’t run arbitrary / ad-hoc queriesDryadLINQ, FlumeJava
»Language-integrated “distributed dataset” API, but cannot share datasets efficiently across queries
Nectar [OSDI 10]»Automatic expression caching, but over distributed FS
PacMan [NSDI 12]»Memory cache for HDFS, but writes still go to network/disk
![Page 25: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/25.jpg)
ConclusionRDDs offer a simple and efficient programming model for a broad range of applicationsLeverage the coarse-grained nature of many parallel algorithms for low-overhead recoveryTry it out at www.spark-project.org
![Page 26: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/26.jpg)
Behavior with Insufficient RAM
0% 25% 50% 75% 100%0
20
40
60
80
10068
.8
58.1
40.7
29.7
11.5
Percent of working set in memory
Iter
atio
n ti
me
(s)
![Page 27: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/27.jpg)
Scalability
25 50 1000
50
100
150
200
250
184
111
76
116
80
62
15 6 3
HadoopHadoopBinMemSpark
Number of machines
Iter
atio
n ti
me
(s)
25 50 1000
50100150200250300 27
4
157
106
197
121
87
143
61
33
Hadoop HadoopBinMemSpark
Number of machines
Iter
atio
n ti
me
(s)
Logistic Regression K-Means
![Page 28: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/28.jpg)
Breaking Down the Speedup
In-mem HDFS
In-mem local file
Spark RDD0
5
10
15
2015
.4
13.1
2.9
8.4
6.9
2.9
Text InputBinary Input
Iter
atio
n ti
me
(s)
![Page 29: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/29.jpg)
Spark Operations
Transformations
(define a new RDD)
mapfilter
samplegroupByKeyreduceByKey
sortByKey
flatMapunionjoin
cogroupcross
mapValues
Actions(return a result
to driver program)
collectreducecountsave
lookupKey
![Page 30: Resilient Distributed Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062502/5681632f550346895dd3aaed/html5/thumbnails/30.jpg)
Task SchedulerDryad-like DAGsPipelines functionswithin a stageLocality & data reuse awarePartitioning-awareto avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition