structured streams in spark 2 - files.meetup.comfiles.meetup.com/19158234/slides long...
TRANSCRIPT
![Page 1: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/1.jpg)
Structured Streams in Spark 2.0
Long Tran
![Page 2: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/2.jpg)
• RDDs • RDD / Scala API
• D-Streams (0.7) • SQL (1.3)
• Dataframes API • Catalyst • Tungsten (1.4)
• Structured Streams (finally!) (2.0) • File API - future APIs • Exactly Once semantics
Overview
![Page 3: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/3.jpg)
https://spark.apache.org/
![Page 4: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/4.jpg)
RDD
A parallelized, lazily evaluated, directed acyclic graph of computation.
Resilient Distributed Dataset
![Page 6: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/6.jpg)
RDD + Scala Word CountSCALA
SPARK
![Page 7: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/7.jpg)
![Page 8: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/8.jpg)
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
![Page 9: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/9.jpg)
DStream API
![Page 10: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/10.jpg)
DStream Word Count
![Page 11: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/11.jpg)
SQL & DataFramesAPI for computing structured data
![Page 12: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/12.jpg)
DataFrames and SQL Word Count
DataFrames
SQL
![Page 13: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/13.jpg)
Catalyst
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
![Page 14: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/14.jpg)
Catalyst
![Page 15: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/15.jpg)
Catalyst
![Page 16: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/16.jpg)
Tungsten
www.slideshare.net/databricks/2015-0616-spark-summit
Memory Management and Binary Processing
![Page 17: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/17.jpg)
TungstenCache Aware Computation
![Page 18: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/18.jpg)
\
![Page 19: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/19.jpg)
Spark 2.0 is the ALPHA RELEASE of Structured
StreamingStructured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-
once stream processing without the user having to reason about streaming.
![Page 20: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/20.jpg)
Classic Streaming
![Page 21: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/21.jpg)
Continuous Applications
![Page 22: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/22.jpg)
https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
Programming Model
![Page 23: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/23.jpg)
![Page 24: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/24.jpg)
Structured Stream Word Count
![Page 25: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/25.jpg)
end-to-end exactly once guarantees
![Page 26: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/26.jpg)
Conclusions• Dataframe and SQL for streaming • Catalyst! • Tungsten! • Unified API for batch and streaming (+ ML +
GraphFrames) • BIs, DBAs, Data Scientists can now do streaming! • Exactly once guarantees • No need to reason about intervals • Event Time primitives
![Page 27: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/27.jpg)
Future• Current support for reading file streams only • Kafka Integration (2.1) • Public API for sources and sinks • Watermarks • ML Integration - continuously updated models
![Page 28: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/28.jpg)
@LooooongTran
![Page 29: Structured Streams in Spark 2 - files.meetup.comfiles.meetup.com/19158234/Slides Long Version.pdf · Spark 2.0 is the ALPHA RELEASE of Structured Streaming Structured Streaming provides](https://reader030.vdocuments.us/reader030/viewer/2022040608/5ec9194277aaaf6b67455115/html5/thumbnails/29.jpg)
Sources• https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-
spark.html • https://databricks.com/blog/2016/07/28/continuous-applications-evolving-
streaming-in-apache-spark-2-0.html • https://www.youtube.com/watch?v=rl8dIzTpxrI • https://www.youtube.com/watch?v=fn3WeMZZcCk • https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html • https://www.oreilly.com/learning/apache-spark-2-0--introduction-to-structured-
streaming • https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-
closer-to-bare-metal.html • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html • https://www.youtube.com/watch?v=1a4pgYzeFwE • https://www.youtube.com/watch?v=5ajs8EIPWGI • http://www.kdnuggets.com/2016/05/spark-tungsten-burns-brighter.html • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-
whole-stage-codegen.html