high?speed’in?memory’analytics...
TRANSCRIPT
![Page 1: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/1.jpg)
Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (Manheim, GT)
Spark & Spark SQLHigh-‐Speed In-‐Memory Analytics over Hadoop and Hive Data
CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech
Instructor: Duen Horng (Polo) Chau1
![Page 2: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/2.jpg)
What is Spark ?Not a modified version of Hadoop
Separate, fast, MapReduce-‐like engine » In-‐memory data storage for very fast iterative queries »General execution graphs and powerful optimizations »Up to 40x faster than Hadoop
Compatible with Hadoop’s storage APIs »Can read/write to any Hadoop-‐supported system, including HDFS, HBase, SequenceFiles, etc
http://spark.apache.org
2
![Page 3: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/3.jpg)
What is Spark SQL? (Formally called Shark)
Port of Apache Hive to run on Spark
Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)
Similar speedups of up to 40x
3
![Page 4: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/4.jpg)
Project History [latest: v1.1]Spark project started in 2009 at UC Berkeley AMP lab, open sourced 2010
Became Apache Top-‐Level Project in Feb 2014
Shark/Spark SQL started summer 2011
Built by 250+ developers and people from 50 companies
Scale to 1000+ nodes in production
In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research, …
UC BERKELEY
http://en.wikipedia.org/wiki/Apache_Spark 4
![Page 5: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/5.jpg)
Why a New Programming Model?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:»More complex, multi-‐stage applications (e.g.iterative graph algorithms and machine learning)»More interactive ad-‐hoc queries
5
![Page 6: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/6.jpg)
Why a New Programming Model?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:»More complex, multi-‐stage applications (e.g.iterative graph algorithms and machine learning)»More interactive ad-‐hoc queries
Require faster data sharing across parallel jobs
5
![Page 7: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/7.jpg)
Is MapReduce dead?Up for debate… as of 10/7/2014
http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/
6
![Page 8: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/8.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
7
![Page 9: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/9.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to replication, serialization, and disk IO 7
![Page 10: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/10.jpg)
iter. 1 iter. 2 . . .
Input
Data Sharing in Spark
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-‐time processing
8
![Page 11: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/11.jpg)
iter. 1 iter. 2 . . .
Input
Data Sharing in Spark
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-‐time processing
10-‐100× faster than network and disk 8
![Page 12: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/12.jpg)
Spark Programming ModelKey idea: resilient distributed datasets (RDDs) »Distributed collections of objects that can be cached in memory across cluster nodes »Manipulated through various parallel operators »Automatically rebuilt on failure
Interface »Clean language-‐integrated API in Scala »Can be used interactively from Scala, Python console »Supported languages: Java, Scala, Python
9
![Page 13: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/13.jpg)
http://www.scala-lang.org/old/faq/4 Java vs Scala: http://www.toptal.com/scala/why-should-i-learn-scala
10
![Page 14: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/14.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
11
![Page 15: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/15.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Worker
Worker
Worker
Driver
11
![Page 16: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/16.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11
![Page 17: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/17.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
Base RDD
11
![Page 18: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/18.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11
![Page 19: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/19.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
Transformed RDD
11
![Page 20: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/20.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11
![Page 21: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/21.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11
![Page 22: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/22.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).countAction
11
![Page 23: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/23.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11
![Page 24: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/24.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11
![Page 25: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/25.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
11
![Page 26: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/26.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
11
![Page 27: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/27.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
Cache 1
Cache 2
Cache 3
11
![Page 28: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/28.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
Cache 1
Cache 2
Cache 3
11
![Page 29: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/29.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
Cache 1
Cache 2
Cache 3
11
![Page 30: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/30.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache 1
Cache 2
Cache 3
11
![Page 31: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/31.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache 1
Cache 2
Cache 3
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data)
11
![Page 32: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/32.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache 1
Cache 2
Cache 3
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data)Result: scaled to 1 TB data in 5-‐7 sec
(vs 170 sec for on-‐disk data)11
![Page 33: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/33.jpg)
Fault ToleranceRDDs track the series of transformations used to build them (their lineage) to recompute lost data
E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDD path = hdfs://…
FilteredRDD func = _.contains(...)
MappedRDD func = _.split(…)
12
![Page 34: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/34.jpg)
Example: Word Count (Python)
13
![Page 35: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/35.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
14
![Page 36: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/36.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Load data in memory once
14
![Page 37: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/37.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Initial parameter vector
14
![Page 38: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/38.jpg)
Example: Logistic Regressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Repeated MapReduce steps to do gradient descent
14
![Page 39: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/39.jpg)
Logistic Regression Performance
Run
ning
Tim
e (s)
0
1000
2000
3000
4000
Number of Iterations
1 5 10 20 30
HadoopSpark
127 s / iteration
first iteration 174 s further iterations 6 s
15
![Page 40: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/40.jpg)
Supported Operatorsmap
filter
groupBy
sort
join
leftOuterJoin
rightOuterJoin
reduce count reduceByKey
groupByKey
first union cross
sample cogroup
take
partitionBy
pipe save ...
16
![Page 41: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/41.jpg)
Spark Users
17
![Page 42: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/42.jpg)
Use CasesIn-‐memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
. . .
18
![Page 43: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/43.jpg)
Conviva GeoReport
Group aggregations on many keys w/ same filter
40× gain over Hive; avoid repeated reading, deserialization, filtering
Hive
Spark
0 5 10 15 20
0.5
20
Time (hours)
19
http://en.wikipedia.org/wiki/Conviva
![Page 44: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/44.jpg)
Mobile Millennium Project
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
Iterative EM algorithm scaling to 160 nodes
Estimate city traffic from crowdsourced GPS data
20
![Page 45: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/45.jpg)
Spark SQL: Hive on Spark
21
![Page 46: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/46.jpg)
MotivationHive is great, but Hadoop’s execution engine makes even the smallest queries take minutes
Scala is good for programmers, but many data users only know SQL
Can we extend Hive to run on Spark?
22
![Page 47: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/47.jpg)
Hive Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
23
![Page 48: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/48.jpg)
Spark SQL Architecture
Meta store
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimizer
[Engle et al, SIGMOD 2012]24
![Page 49: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/49.jpg)
Efficient In-‐Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead
Instead, Spark SQL employs column-‐oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.425
![Page 50: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/50.jpg)
Efficient In-‐Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-‐object overhead
Instead, Spark SQL employs column-‐oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Benefit: similarly compact size to serialized data,but >5x faster to access
26
![Page 51: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/51.jpg)
Using SharkCREATE TABLE mydata_cached AS SELECT …
Run standard HiveQL on it, including UDFs »A few esoteric features are not yet supported
Can also call from Scala to mix with Spark
27
![Page 52: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/52.jpg)
Benchmark Query 1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;
28
![Page 53: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/53.jpg)
Benchmark Query 2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;
29
![Page 54: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/54.jpg)
What’s Next?Recall that Spark’s model was motivated by two emerging uses (interactive and multi-‐stage apps)
Another emerging use case that needs fast data sharing is stream processing »Track and update state in memory as events arrive »Large-‐scale reporting, click analysis, spam filtering, etc
30
![Page 55: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/55.jpg)
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-‐tolerant RDDs
Intermix seamlessly with batch and ad-‐hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012] 31
![Page 56: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/56.jpg)
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-‐tolerant RDDs
Intermix seamlessly with batch and ad-‐hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012]
Result: can process 42 million records/second(4 GB/s) on 100 nodes at sub-‐second latency
32
![Page 57: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/57.jpg)
Streaming SparkExtends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-‐tolerant RDDs
Intermix seamlessly with batch and ad-‐hoc queries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zaharia et al, HotCloud 2012] 33
![Page 58: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/58.jpg)
Spark StreamingCreate and operate on RDDs from live data streams at set intervals
Data is divided into batches for processing
Streams may be combined as a part of processing or analyzed with higher level transforms
34
![Page 59: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/59.jpg)
Behavior with Not Enough RAM
Iteration time (s)
0
25
50
75
100
% of working set in memory
Cache disabled 25% 50% 75% Fully cached
11.5
29.740.7
58.168.8
35
![Page 60: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/60.jpg)
SPARK PLATFORM
36
Standard FS/HDFS/CFS/S3
GraphXSpark SQL Shark
Spark Streaming
YARN/Spark/Mesos
Scala/Python/Java
RDD
MLlib Execution
Resource Management
Data Storage
![Page 61: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/61.jpg)
37
![Page 62: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/62.jpg)
MLlibScalable machine learning library
Interoperates with NumPy
Available algorithms in 1.0 » Linear Support Vector Machine (SVM) » Logistic Regression » Linear Least Squares » Decision Trees » Naïve Bayes » Collaborative Filtering with ALS » K-‐means » Singular Value Decomposition » Principal Component Analysis » Gradient Descent
38
![Page 63: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/63.jpg)
GraphXParallel graph processing
Extends RDD -‐> Resilient Distributed Property Graph » Directed multigraph with properties attached to each vertex and edge
Limited algorithms » PageRank » Connected Components » Triangle Counts
Alpha component39
![Page 64: High?Speed’In?Memory’Analytics overHadoop’and’Hive’Datapoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-9-ScalingUp3-spark.pdf · WhatisSpark? Not’a’modified’version’of’Hadoop’](https://reader034.vdocuments.us/reader034/viewer/2022042107/5e86489f0983cd60de1352da/html5/thumbnails/64.jpg)
Commercial SupportDatabricks »Not to be confused with DataStax »Found by members of the AMPLab »Offering
• Certification • Training • Support • DataBricks Cloud
40