introduction to structured streaming
TRANSCRIPT
© 2015 IBM Corporation 1
! Agenda
- Spark Streaming 1.X • Features • Areas for Improvement
- Spark Streaming 2.0 – Structured Streaming • Addressing the Improvement Areas • API • Fault Tolerance • Event Time • Managing Streaming queries
- Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming
- Summary thoughts
© 2015 IBM Corporation 2
Spark Streaming 1. X
! Features of Spark Streaming - High Level API (stateful, joins, aggregates, windows etc.)
• Overlap with RDD API (batch) - Fault – Tolerant (exactly once semantics achievable) - Back Pressure - Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)
!
Apache Hadoop Day 2015
© 2015 IBM Corporation 3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
Can that be handled in a very simple way for the end-user ?
Apache Hadoop Day 2015
© 2015 IBM Corporation 4
Fault-Tolerant Semantics
Exactly Once, If Outputs are Idempotent or transac6onal
Exactly Once, as long as received data is not lost
Exactly Once needs re-‐playable sources (e.g. Ka?a Direct)
Source
Receiver
Transforming
Outputting
Sink
© 2015 IBM Corporation 5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance - For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink
! API - Request for more seamless API between Batch & Stream - Reduce complexities of streaming app *
! No Event Time support - Hard to support when processing time/batch time exposed in externals
! Streaming Query Management
! Micro-batch
!
Apache Hadoop Day 2015
© 2015 IBM Corporation 6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL.
! Challenge - Remove/Keep streaming complexities to minimum
!
© 2015 IBM Corporation 7
Lets Dive in
© 2015 IBM Corporation 8
SQL Batch vs SQL Streaming- Conceptually
© 2015 IBM Corporation 9
Batch vs Streaming - Programmatically
© 2015 IBM Corporation 10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sink)
! Output modes
- Complete – Entire updated Result table is written to external storage.
- Append – Only new rows added in the Result table since last incremental query execution is written to external storage.
- Update - Only the rows updated in the Result table since last incremental query execution is written to external storage.
Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
© 2015 IBM Corporation 11
Supported Sinks & Modes in 2.0
*DEBUG ONLY
*DEBUG ONLY
© 2015 IBM Corporation 12
Windowing in Structured Streaming
© 2015 IBM Corporation 13
Window operations ! Continuous time based aggregations are most common in Streaming applications.
- Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes
! New function that treats windowing as a regular aggregation
! Used in a Group By clause Can be used in Batch as well
© 2015 IBM Corporation 14
Event Time Windows ! Event-Time is time embedded within the data itself
It is not the time Spark received the data
! What about processing time windows if you want them
© 2015 IBM Corporation 15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group
! Use a normal filter in the SQL ?
! Watermarks
© 2015 IBM Corporation 16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss ! Atleast Once ! Exactly Once
! What all can fail? ! Driver ! Executor
© 2015 IBM Corporation 17
Spark 1.x Best Fault tolerance - Kafka Direct API
• Simplified Parallelism
• Less Storage Need
• Exactly Once Semantics. source & processing
Benefits of this approach
© 2015 IBM Corporation 18
Fault Tolerance in Structured Streaming
Active Driver
Checkpoint to HDFS
! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger
Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers
! On Recovery Restart processing of nth entry in WAL
© 2015 IBM Corporation 19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with - idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)
- Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
© 2015 IBM Corporation 20
Managing Streaming Queries
! Streaming in 1.x was definetly lacking in
- Starting / Stopping individual Streaming Queries
- Changing the computation done in a Query.
- When a Streaming Query abnormally terminates handle more gracefully than app crash.
© 2015 IBM Corporation 21
Managing Streaming Queries
© 2015 IBM Corporation 22
Managing Streaming Queries
© 2015 IBM Corporation 23
Summary ! Overall has a good set of features
- Easier code share between Batch and Streaming (No different type hierarchies)
- Window not tied to Batch interval
- No Streaming context
- Optimizer now available for your queries.
! Getting started - Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *
And not much control over those. - Only get Runtime exceptions when you mess with above
! How does it compare to Apache Beam ?
© 2015 IBM Corporation 24
For Each Sink
© 2015 IBM Corporation 25
Thank YOU