introduction to structured streaming

© 2015 IBM Corporation 1

! Agenda

- Spark Streaming 1.X •  Features • Areas for Improvement

- Spark Streaming 2.0 – Structured Streaming • Addressing the Improvement Areas • API •  Fault Tolerance • Event Time • Managing Streaming queries

- Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming

- Summary thoughts


Spark Streaming 1. X

! Features of Spark Streaming - High Level API (stateful, joins, aggregates, windows etc.)

• Overlap with RDD API (batch) - Fault – Tolerant (exactly once semantics achievable) - Back Pressure - Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)

!

Apache Hadoop Day 2015


Spark Streaming 1. X – Areas of improvement

! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink

Can that be handled in a very simple way for the end-user ?



Fault-Tolerant Semantics

Exactly Once, If Outputs are Idempotent or transac6onal

Exactly Once, as long as received data is not lost

Exactly Once needs re-‐playable sources (e.g. Ka?a Direct)

Source

Receiver

Transforming

Outputting

Sink


Spark Streaming 1. X – Areas of improvement

! Fault-tolerance - For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink

! API - Request for more seamless API between Batch & Stream - Reduce complexities of streaming app *

! No Event Time support - Hard to support when processing time/batch time exposed in externals

! Streaming Query Management

! Micro-batch

!



Spark Streaming 2.0 API

! Built on top of Spark SQL Engine

! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL.

! Challenge - Remove/Keep streaming complexities to minimum

!


Lets Dive in


SQL Batch vs SQL Streaming- Conceptually


Batch vs Streaming - Programmatically


Output Modes - Sink

! Defined as what gets written from the Result table to external storage (Sink)

! Output modes

- Complete – Entire updated Result table is written to external storage.

- Append – Only new rows added in the Result table since last incremental query execution is written to external storage.

- Update - Only the rows updated in the Result table since last incremental query execution is written to external storage.

Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode


Supported Sinks & Modes in 2.0

*DEBUG ONLY

*DEBUG ONLY


Windowing in Structured Streaming


Window operations !  Continuous time based aggregations are most common in Streaming applications.

-  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes

! New function that treats windowing as a regular aggregation

!  Used in a Group By clause Can be used in Batch as well


Event Time Windows ! Event-Time is time embedded within the data itself

It is not the time Spark received the data

! What about processing time windows if you want them


Handling Late Arrival in Event-Time

! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group

! Use a normal filter in the SQL ?

! Watermarks


Fault Tolerance

! Why Care?

! Different guarantees for Data Loss ! Atleast Once ! Exactly Once

! What all can fail? ! Driver ! Executor


Spark 1.x Best Fault tolerance - Kafka Direct API

•  Simplified Parallelism

•  Less Storage Need

•  Exactly Once Semantics. source & processing

Benefits of this approach


Fault Tolerance in Structured Streaming

Active Driver

Checkpoint to HDFS

! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger

Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers

! On Recovery Restart processing of nth entry in WAL


Fault Tolerance in Structured Streaming

! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)

- Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842


Managing Streaming Queries

!  Streaming in 1.x was definetly lacking in

- Starting / Stopping individual Streaming Queries

- Changing the computation done in a Query.

- When a Streaming Query abnormally terminates handle more gracefully than app crash.


Summary !  Overall has a good set of features

-  Easier code share between Batch and Streaming (No different type hierarchies)

-  Window not tied to Batch interval

-  No Streaming context

-  Optimizer now available for your queries.

!  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *

And not much control over those. -  Only get Runtime exceptions when you mess with above

!  How does it compare to Apache Beam ?


For Each Sink


Thank YOU

introduction to structured streaming

Technology