spark
DESCRIPTION
Spark. Fast, Interactive, Language-Integrated Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica. UC Berkeley. Background. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/1.jpg)
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,Scott Shenker, Ion Stoica
SparkFast, Interactive, Language-Integrated Cluster Computing
UC Berkeley
![Page 2: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/2.jpg)
BackgroundMapReduce and its variants greatly simplified big data analytics by hiding scaling and faults
However, these systems provide a restricted programming model
Can we design similarly powerful abstractions for a broader class of applications?
![Page 3: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/3.jpg)
MotivationMost current cluster programming models are based on acyclic data flow from stable storage to stable storage
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
Input Output
![Page 4: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/4.jpg)
Motivation
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
Input Output
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures
Most current cluster programming models are based on acyclic data flow from stable storage to stable storage
![Page 5: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/5.jpg)
MotivationAcyclic data flow is inefficient for applications that repeatedly reuse a working set of data:»Iterative algorithms (machine learning, graphs)
»Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data from stable storage on each query
![Page 6: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/6.jpg)
Spark GoalEfficiently support apps with working sets
»Let them keep data in memory
Retain the attractive properties of MapReduce:
»Fault tolerance (for crashes & stragglers)»Data locality»ScalabilitySolution: extend data flow model with “resilient distributed datasets”
(RDDs)
![Page 7: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/7.jpg)
OutlineSpark programming model
Applications
Implementation
Demo
![Page 8: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/8.jpg)
Programming ModelResilient distributed datasets (RDDs)
»Immutable, partitioned collections of objects
»Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage
»Can be cached for efficient reuse
Actions on RDDs»Count, reduce, collect, save, …
![Page 9: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/9.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()Block 1Block 1
Block 2Block 2
Block 3Block 3
Worker
Worker
Worker
Worker
Worker
Worker
Driver
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 1
Cache 2
Cache 2
Cache 3
Cache 3
Base RDDBase RDD
Transformed RDD
Transformed RDD
ActionAction
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
![Page 10: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/10.jpg)
RDD Fault ToleranceRDDs maintain lineage information that can be used to reconstruct lost partitions
Ex:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()
HdfsRDDpath: hdfs://…
HdfsRDDpath: hdfs://…
FilteredRDDfunc:
contains(...)
FilteredRDDfunc:
contains(...)
MappedRDDfunc: split(…)MappedRDDfunc: split(…) CachedRDDCachedRDD
![Page 11: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/11.jpg)
Example: Logistic RegressionGoal: find best line separating two sets of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
![Page 12: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/12.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}
println("Final w: " + w)
![Page 13: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/13.jpg)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
![Page 14: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/14.jpg)
Spark ApplicationsTwitter spam classification (Monarch)
In-memory analytics on Hive data (Conviva)
Traffic prediction using EM (Mobile Millennium)
K-means clustering
Alternating Least Squares matrix factorization
Network simulation
![Page 15: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/15.jpg)
Conviva GeoReport
Aggregations on many group keys w/ same WHERE clause
Gains come from:» Not re-reading unused columns» Not re-reading filtered records» Not repeated decompression» In-memory storage of deserialized objects
Time (hours)
![Page 16: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/16.jpg)
Generality of RDDsRDDs can efficiently express many proposed cluster programming models
»MapReduce => map and reduceByKey operations
»Dryad => Spark runs general DAGs of tasks»Pregel iterative graph processing => Bagel»SQL => Hive on Spark (Hark?)
Can also express apps that neither of these can
![Page 17: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/17.jpg)
ImplementationSpark runs on the Mesos cluster manager, letting it share resources with Hadoop & other apps
Can read from any Hadoop input source (e.g. HDFS)
SparkSpark Hadoop
Hadoop MPIMPI
MesosMesos
NodeNode NodeNode NodeNode NodeNode
…
~7000 lines of code; no changes to Scala compiler
![Page 18: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/18.jpg)
Language IntegrationScala closures are Serializable Java objects
»Serialize on master, load & run on workers
Not quite enough»Nested closures may reference entire outer scope»May pull in non-Serializable variables not used
inside»Solution: bytecode analysis + reflection
Other tricks using custom serialized forms (e.g. “accumulators” as syntactic sugar for counters)
![Page 19: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/19.jpg)
Interactive SparkModified Scala interpreter to allow Spark to be used interactively from the command line
Required two changes:»Modified wrapper code generation so that each
line typed has references to objects for its dependencies
»Distribute generated classes over the network
Enables in-memory exploration of big data
![Page 20: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/20.jpg)
Demo
![Page 21: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/21.jpg)
ConclusionSpark’s resilient distributed datasets are a simple programming model for a wide range of apps
Download our open source release at
www.spark-project.org
![Page 22: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/22.jpg)
Related WorkDryadLINQ
»Build queries through language-integrated SQL operations on lazy datasets
»Cannot have a dataset persist across queries
Relational databases»Lineage/provenance, logical logging, materialized
views
Piccolo»Parallel programs with shared distributed hash
tables; similar to distributed shared memory
Iterative MapReduce (Twister and HaLoop)»Cannot define multiple distributed datasets, run
different map/reduce pairs on them, or query data interactively
![Page 23: Spark](https://reader033.vdocuments.us/reader033/viewer/2022051623/5681584e550346895dc5ab11/html5/thumbnails/23.jpg)
Related WorkDistributed shared memory (DSM)
»Very general model allowing random reads/writes, but hard to implement efficiently (needs logging or checkpointing)
RAMCloud» In-memory storage system for web applications»Allows random reads/writes and uses logging like DSM
Nectar»Caching system for DryadLINQ programs that can
reuse intermediate results across jobs»Does not provide caching in memory, explicit support
over which data is cached, or control over partitioning
SMR (functional Scala API for Hadoop)