distributed computing with spark and mapreducerezab/dao/notes/intro_spark.pdf · 2018-05-17 ·...
TRANSCRIPT
![Page 1: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/1.jpg)
Reza Zadeh
Distributed Computing with Spark and MapReduce
@Reza_Zadeh | http://reza-zadeh.com
![Page 2: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/2.jpg)
Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters » Wide use in both enterprises and web industry
How do we program these things?
![Page 3: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/3.jpg)
Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine
Current State of Spark Ecosystem Built-in Libraries
![Page 4: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/4.jpg)
Data flow vs. traditional network programming
![Page 5: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/5.jpg)
Traditional Network Programming
Message-passing between nodes (e.g. MPI) Very difficult to do at scale: » How to split problem across nodes?
• Must consider network & data locality » How to deal with failures? (inevitable at scale) » Even worse: stragglers (node not failed, but slow) » Ethernet networking not fast » Have to write programs for each machine
Rarely used in commodity datacenters
![Page 6: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/6.jpg)
Disk vs Memory L1 cache reference: 0.5 ns
L2 cache reference: 7 ns
Mutex lock/unlock: 100 ns
Main memory reference: 100 ns
Disk seek: 10,000,000 ns
![Page 7: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/7.jpg)
Network vs Local Send 2K bytes over 1 Gbps network: 20,000 ns
Read 1 MB sequentially from memory: 250,000 ns
Round trip within same datacenter: 500,000 ns
Read 1 MB sequentially from network: 10,000,000 ns
Read 1 MB sequentially from disk: 30,000,000 ns
Send packet CA->Netherlands->CA: 150,000,000 ns
![Page 8: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/8.jpg)
Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks
and where to run each task » Run parts twice fault recovery
Biggest example: MapReduce Map
Map
Map
Reduce
Reduce
![Page 9: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/9.jpg)
MapReduce + GFS Most of early Google infrastructure, tremendously successful
Replicate disk content 3 times, sometimes 8
Rewrite algorithms for MapReduce
![Page 10: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/10.jpg)
Diagram of typical cluster http://insightdataengineering.com/blog/pipeline_map.html
![Page 11: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/11.jpg)
Example MapReduce Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods
Stochastic SVD Tall skinny QR
Many others!
![Page 12: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/12.jpg)
Why Use a Data Flow Engine? Ease of programming » High-level functions instead of message passing
Wide deployment » More common than MPI, especially “near” data
Scalability to very largest clusters » Even HPC world is now concerned about resilience
Examples: Pig, Hive, Scalding, Storm
![Page 13: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/13.jpg)
Limitations of MapReduce
![Page 14: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/14.jpg)
Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage
![Page 15: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/15.jpg)
iter. 1 iter. 2 . . .
Input
file system"read
file system"write
file system"read
file system"write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
file system"read
Commonly spend 90% of time doing I/O
Example: Iterative Apps
![Page 16: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/16.jpg)
Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector
Neighbors (id, edges)
Ranks (id, rank) …
Same file grouped over and over
iteration 1 iteration 2 iteration 3
![Page 17: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/17.jpg)
Result While MapReduce is simple, it can require asymptotically more communication or I/O
![Page 18: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/18.jpg)
Verdict MapReduce algorithms research doesn’t go to waste, it just gets sped up and easier to use
Still useful to study as an algorithmic framework, silly to use directly
![Page 19: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/19.jpg)
Spark computing engine
![Page 20: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/20.jpg)
Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD)
Open source at Apache » Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python, R
![Page 21: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/21.jpg)
Resilient Distributed Datasets (RDDs)
Main idea: Resilient Distributed Datasets » Immutable collections of objects, spread across cluster » Statically typed: RDD[T] has objects of type T
val sc = new SparkContext()!val lines = sc.textFile("log.txt") // RDD[String]!!// Transform using standard collection operations !val errors = lines.filter(_.startsWith("ERROR")) !val messages = errors.map(_.split(‘\t’)(2)) !!messages.saveAsTextFile("errors.txt") !
lazily evaluated
kicks off a computation
![Page 22: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/22.jpg)
Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
![Page 23: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/23.jpg)
Python, Java, Scala, R // Scala:
val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
// Java (better in java8!):
JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
![Page 24: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/24.jpg)
Fault Tolerance
file.map(lambdarec:(rec.type,1)).reduceByKey(lambdax,y:x+y).filter(lambda(type,count):count>10)
filter reduce map
Inpu
t file
RDDs track lineage info to rebuild lost data
![Page 25: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/25.jpg)
filter reduce map
Inpu
t file
Fault Tolerance
file.map(lambdarec:(rec.type,1)).reduceByKey(lambdax,y:x+y).filter(lambda(type,count):count>10)
RDDs track lineage info to rebuild lost data
![Page 26: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/26.jpg)
Partitioning
file.map(lambdarec:(rec.type,1)).reduceByKey(lambdax,y:x+y).filter(lambda(type,count):count>10)
filter reduce map
Inpu
t file
RDDs know their partitioning functions
Known to be"hash-partitioned
Also known
![Page 27: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/27.jpg)
Spark in this class Training distribution and data in first homework, also on class webpage"
Databricks Cloud to try a real cluster, third week, handing out clusters to all of you" Download it yourself! spark.apache.org
![Page 28: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/28.jpg)
Other Data-flow Graph Computations: Pregel, GraphLab SQL based engines: Hive, Pig, …
… jury still out?
![Page 29: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/29.jpg)
Benefit for Users Same engine performs data extraction, model training and interactive queries
… DFS read
DFS write pa
rse DFS read
DFS write tra
in DFS read
DFS write qu
ery
DFS
DFS read pa
rse
train
quer
y
Separate engines
Spark
![Page 30: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/30.jpg)
State of the Spark ecosystem
![Page 31: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/31.jpg)
Most active open source community in big data
200+ developers, 50+ companies contributing
Spark Community
Giraph Storm
0
50
100
150
Contributors in past year
![Page 32: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/32.jpg)
Project Activity M
apRe
duce
YA
RN HD
FS
Stor
m
Spar
k
0
200
400
600
800
1000
1200
1400
1600
Map
Redu
ce
YARN
HDFS
St
orm
Spar
k
0
50000
100000
150000
200000
250000
300000
350000
Commits Lines of Code Changed
Activity in past 6 months
![Page 33: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/33.jpg)
Continuing Growth
source: ohloh.net
Contributors per month to Spark
![Page 34: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/34.jpg)
Built-in libraries
![Page 35: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/35.jpg)
Standard Library for Big Data Big data apps lack libraries"of common algorithms Spark’s generality + support"for multiple languages make it"suitable to offer this
Core
SQL ML graph …
Python Scala Java R
Much of future activity will be in these libraries
![Page 36: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/36.jpg)
A General Platform
Spark Core
Spark Streaming"
real-time
Spark SQL structured
GraphX graph
MLlib machine learning
…
Standard libraries included with Spark
![Page 37: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/37.jpg)
Machine Learning Library (MLlib)
40 contributors in past year
points = context.sql(“select latitude, longitude from tweets”) !
model = KMeans.train(points, 10) !!
![Page 38: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/38.jpg)
MLlib algorithms classification: logistic regression, linear SVM,"naïve Bayes, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
![Page 39: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/39.jpg)
39
GraphX
![Page 40: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/40.jpg)
40
General graph processing library Build graph using RDDs of nodes and edges
Large library of graph algorithms with composable steps
GraphX
![Page 41: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/41.jpg)
Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs
41
Spark
SparkStreaming
batchesofXseconds
livedatastream
processedresults
• ChopupthelivestreamintobatchesofXseconds
• SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDopera;ons
• Finally,theprocessedresultsoftheRDDopera;onsarereturnedinbatches
![Page 42: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/42.jpg)
Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs
42
Spark
SparkStreaming
batchesofXseconds
livedatastream
processedresults
• Batchsizesaslowas½second,latency~1second
• Poten;alforcombiningbatchprocessingandstreamingprocessinginthesamesystem
![Page 43: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/43.jpg)
Spark SQL // Run SQL statements !val teenagers = context.sql( ! "SELECT name FROM people WHERE age >= 13 AND age <= 19") !
!
// The results of SQL queries are RDDs of Row objects !val names = teenagers.map(t => "Name: " + t(0)).collect() !
![Page 44: Distributed Computing with Spark and MapReducerezab/dao/notes/Intro_Spark.pdf · 2018-05-17 · Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient](https://reader035.vdocuments.us/reader035/viewer/2022062921/5f03877d7e708231d409818d/html5/thumbnails/44.jpg)
Enables loading & querying structured data in Spark
c = HiveContext(sc) !rows = c.sql(“select text, year from hivetable”) !rows.filter(lambda r: r.year > 2013).collect() !
From Hive:
{“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }}
c.jsonFile(“tweets.json”).registerAsTable(“tweets”) !c.sql(“select text, user.name from tweets”) !
From JSON: tweets.json
Spark SQL