apache spark: a unified engine for big data...
TRANSCRIPT
![Page 1: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/1.jpg)
Apache Spark:A Unified Engine for Big Data Processing
Presented by: Huanyi Chen
![Page 2: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/2.jpg)
Apache Spark:A Unified Engine for Big Data Processing§ Engine?
§ Unified?
Apache Spark: A Unified Engine for Big Data Processing PAGE 2
![Page 3: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/3.jpg)
Apache Spark:A Unified Engine for Big Data Processing§ Engine?
§ convert one form of data into other useful forms
§ Unified?§ Multiple types of conversions
Apache Spark: A Unified Engine for Big Data Processing PAGE 3
![Page 4: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/4.jpg)
Apache Spark:A Unified Engine for Big Data Processing
§ What is Apache Spark? (Engine)
§ How can it make multiple types of conversions over big data? (Unified)
Apache Spark: A Unified Engine for Big Data Processing PAGE 4
![Page 5: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/5.jpg)
What is Apache Spark?§ A framework like MapReduce
§ Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 5
RDDs
![Page 6: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/6.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 6
![Page 7: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/7.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 7
![Page 8: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/8.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 8
I/O I/O
![Page 9: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/9.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 9
![Page 10: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/10.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 10
§ An RDD is a read-only, partitioned collection of records
§ Transformations§ create RDDs (map, filter, join, etc.)
§ Actions§ return a value to the application
§ or export data to a storage system
§ Persistence§ Users can indicate which RDDs they will reuse and choose a storage strategy for them
(e.g., in-memory storage).
§ Partitioning§ Users can ask that an RDD’s elements be partitioned across machines based on a key
in each record.
![Page 11: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/11.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 11
![Page 12: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/12.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 12
§ Lineage§ An RDD has enough information about how it was derived from
other datasets.
§ Narrow dependencies§ each partition of the parent RDD is used by at most one partition of
the child RDD
§ Wide dependencies:§ multiple child partitions
![Page 13: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/13.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 13
![Page 14: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/14.jpg)
Resilient Distributed Datasets (RDDs)
Apache Spark: A Unified Engine for Big Data Processing PAGE 14
Example: run an action on RDD G
![Page 15: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/15.jpg)
MapReduce Ecosystem Spark Ecosystem
MapReduce vs Spark
Apache Spark: A Unified Engine for Big Data Processing PAGE 15
![Page 16: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/16.jpg)
Higher-Level Libraries
Apache Spark: A Unified Engine for Big Data Processing PAGE 16
![Page 17: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/17.jpg)
SQL and DataFrames
Apache Spark: A Unified Engine for Big Data Processing PAGE 17
![Page 18: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/18.jpg)
SQL and DataFrames
Apache Spark: A Unified Engine for Big Data Processing PAGE 18
§ !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'(§ Spark SQL’s DataFrame API supports inline definition of
user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems.
![Page 19: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/19.jpg)
UDF in MySQL
Apache Spark: A Unified Engine for Big Data Processing PAGE 19
![Page 20: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/20.jpg)
UDF in Spark SQL
Apache Spark: A Unified Engine for Big Data Processing PAGE 20
![Page 21: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/21.jpg)
Spark Streaming
Apache Spark: A Unified Engine for Big Data Processing PAGE 21
![Page 22: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/22.jpg)
Continuous operator processing model Discretized stream processing model
Spark Streaming
Apache Spark: A Unified Engine for Big Data Processing PAGE 22
![Page 23: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/23.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 23
![Page 24: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/24.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 24
![Page 25: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/25.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 25
![Page 26: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/26.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 26
![Page 27: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/27.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 27
![Page 28: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/28.jpg)
GraphX
Apache Spark: A Unified Engine for Big Data Processing PAGE 28
§ Not able to beat specialized graph-parallel systems itself
§ But outperform them in graph analytics pipeline
![Page 29: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/29.jpg)
MLlib
Apache Spark: A Unified Engine for Big Data Processing PAGE 29
![Page 30: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/30.jpg)
MLlib
Apache Spark: A Unified Engine for Big Data Processing PAGE 30
§ More than 50 common algorithms for distributed model training
§ Support pipeline construction on Spark
§ Integrate with other Spark libraries well
![Page 31: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/31.jpg)
Why use Apache Spark?
§ Ecosystem
§ Competitive performance
§ Low cost in sharing data
§ Low latency of MapReduce Steps
§ Control over bottleneck resources
Apache Spark: A Unified Engine for Big Data Processing PAGE 31
![Page 32: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/32.jpg)
Apache Spark in 2016
§ Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs.
§ Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016.
Apache Spark: A Unified Engine for Big Data Processing PAGE 32
![Page 33: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/33.jpg)
Apache Spark Today
Apache Spark: A Unified Engine for Big Data Processing PAGE 33
![Page 34: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/34.jpg)
Apache Spark:A Unified Engine for Big Data Processing
§ What is Apache Spark§ Apache Spark = MapReduce + RDDs
§ How can it make multiple types of conversions over big data§ Higher-level libraries enable Apache Spark to handle different types
of big data workload
Apache Spark: A Unified Engine for Big Data Processing PAGE 34
![Page 35: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/35.jpg)
Apache Spark: A Unified Engine for Big Data Processing PAGE 35
“Try Apache Spark if you are new to the big data processing world”
Huanyi Chen
![Page 36: Apache Spark: A Unified Engine for Big Data Processingtozsu/courses/CS848/W19/presentations/Huanyi... · Apache Spark in 2016 §Apache Spark applications range from finance to scientific](https://reader033.vdocuments.us/reader033/viewer/2022041500/5e212af95c1de406db1b34eb/html5/thumbnails/36.jpg)
Q&A§ What issues will it cause by persisting data in memory? For
example, garbage collection?
§ What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world?
§ Will optimizing one library cause other libraries to lose performance?
§ Is using memory as the storage really the next generation of storage?
Apache Spark: A Unified Engine for Big Data Processing PAGE 36