introduction to spark shannon quinn (with thanks to paco nathan and databricks)
TRANSCRIPT
![Page 1: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/1.jpg)
Introduction to SparkShannon Quinn
(with thanks to Paco Nathan and Databricks)
![Page 2: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/2.jpg)
Quick Demo
![Page 3: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/3.jpg)
Quick Demo
![Page 4: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/4.jpg)
API Hooks• Scala / Java– All Java libraries– *.jar– http://www.scala-lang.org
• Python– Anaconda: https://store.continuum.io/cshop/anaconda/
![Page 5: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/5.jpg)
Introduction
![Page 6: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/6.jpg)
Spark Structure• Start Spark on a cluster• Submit code to be run on it
![Page 7: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/7.jpg)
![Page 8: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/8.jpg)
![Page 9: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/9.jpg)
![Page 10: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/10.jpg)
![Page 11: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/11.jpg)
![Page 12: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/12.jpg)
![Page 13: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/13.jpg)
![Page 14: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/14.jpg)
![Page 15: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/15.jpg)
![Page 16: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/16.jpg)
Another Perspective
![Page 17: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/17.jpg)
Step by step
![Page 18: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/18.jpg)
Step by step
![Page 19: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/19.jpg)
Step by step
![Page 20: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/20.jpg)
Example: WordCount
![Page 21: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/21.jpg)
Example: WordCount
![Page 22: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/22.jpg)
Limitations of MapReduce• Performance bottlenecks—not all jobs can be cast as batch processes–Graphs?
• Programming in Hadoop is hard–Boilerplate boilerplate everywhere
![Page 23: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/23.jpg)
Initial Workaround: Specialization
![Page 24: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/24.jpg)
Along Came Spark• Spark’s goal was to generalize MapReduce to support new applications within the same engine• Two additions:–Fast data sharing–General DAGs (directed acyclic graphs)
• Best of both worlds: easy to program & more efficient engine in general
![Page 25: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/25.jpg)
Codebase Size
![Page 26: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/26.jpg)
More on Spark• More general–Supports map/reduce paradigm–Supports vertex-based paradigm–General compute engine (DAG)
• More API hooks–Scala, Java, and Python
• More interfaces–Batch (Hadoop), real-time (Storm), and interactive (???)
![Page 27: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/27.jpg)
Interactive Shells• Spark creates a
SparkContext object (cluster information)• For either shell: sc• External programs use a static constructor to instantiate the context
![Page 28: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/28.jpg)
Interactive Shells• spark-shell --master
![Page 29: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/29.jpg)
Interactive Shells• Master connects to the cluster manager, which allocates resources across applications• Acquires executors on cluster nodes: worker processes to run computations and store data• Sends app code to executors• Sends tasks for executors to run
![Page 30: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/30.jpg)
Resilient Distributed Datasets (RDDs)• Resilient Distributed Datasets (RDDs) are primary data abstraction in Spark–Fault-tolerant–Can be operated on in parallel1. Parallelized Collections2. Hadoop datasets
• Two types of RDD operations1. Transformations (lazy)2. Actions (immediate)
![Page 31: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/31.jpg)
Resilient Distributed Datasets (RDDs)
![Page 32: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/32.jpg)
Resilient Distributed Datasets (RDDs)• Can create RDDs from any file stored in HDFS–Local filesystem–Amazon S3–HBase
• Text files, SequenceFiles, or any other Hadoop InputFormat• Any directory or glob– /data/201414*
![Page 33: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/33.jpg)
Resilient Distributed Datasets (RDDs)• Transformations–Create a new RDD from an existing one–Lazily evaluated: results are not immediately computed• Pipeline of subsequent transformations can be optimized• Lost data partitions can be recovered
![Page 34: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/34.jpg)
Resilient Distributed Datasets (RDDs)
![Page 35: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/35.jpg)
Resilient Distributed Datasets (RDDs)
![Page 36: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/36.jpg)
Resilient Distributed Datasets (RDDs)
![Page 37: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/37.jpg)
Resilient Distributed Datasets (RDDs)
![Page 38: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/38.jpg)
Closures in Java
![Page 39: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/39.jpg)
Resilient Distributed Datasets (RDDs)• Actions–Create a new RDD from an existing one–Eagerly evaluated: results are immediately computed• Applies previous transformations• (cache results?)
![Page 40: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/40.jpg)
Resilient Distributed Datasets (RDDs)
![Page 41: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/41.jpg)
Resilient Distributed Datasets (RDDs)
![Page 42: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/42.jpg)
Resilient Distributed Datasets (RDDs)
![Page 43: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/43.jpg)
Resilient Distributed Datasets (RDDs)• Spark can persist / cache an RDD in memory across operations• Each slice is persisted in memory and reused in subsequent actions involving that RDD• Cache provides fault-tolerance: if partition is lost, it will be recomputed using transformations that created it
![Page 44: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/44.jpg)
Resilient Distributed Datasets (RDDs)
![Page 45: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/45.jpg)
Resilient Distributed Datasets (RDDs)
![Page 46: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/46.jpg)
Broadcast Variables• Spark’s version of Hadoop’s
DistributedCache• Read-only variable cached on each node• Spark [internally] distributed broadcast variables in such a way to minimize communication cost
![Page 47: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/47.jpg)
Broadcast Variables
![Page 48: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/48.jpg)
Accumulators• Spark’s version of Hadoop’s Counter• Variables that can only be added through an associative operation• Native support of numeric accumulator types and standard mutable collections–Users can extend to new types
• Only driver program can read accumulator value
![Page 49: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/49.jpg)
Accumulators
![Page 50: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/50.jpg)
Key/Value Pairs
![Page 51: Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)](https://reader036.vdocuments.us/reader036/viewer/2022062407/56649cef5503460f949be509/html5/thumbnails/51.jpg)
Resources• Original slide deck: http://cdn.liber118.com/workshop/itas_workshop.pdf• Code samples:–https://gist.github.com/ceteri/f2c3486062c9610eac1d–https://gist.github.com/ceteri/8ae5b9509a08c08a1132–https://gist.github.com/ceteri/11381941