introduction to structured streaming

25
© 2015 IBM Corporation 1 ! Agenda - Spark Streaming 1.X • Features • Areas for Improvement - Spark Streaming 2.0 – Structured Streaming • Addressing the Improvement Areas • API • Fault Tolerance • Event Time • Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts

Upload: datamantra

Post on 16-Apr-2017

371 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Introduction to Structured Streaming

© 2015 IBM Corporation 1

! Agenda

- Spark Streaming 1.X •  Features • Areas for Improvement

- Spark Streaming 2.0 – Structured Streaming • Addressing the Improvement Areas • API •  Fault Tolerance • Event Time • Managing Streaming queries

- Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming

- Summary thoughts

Page 2: Introduction to Structured Streaming

© 2015 IBM Corporation 2

Spark Streaming 1. X

! Features of Spark Streaming - High Level API (stateful, joins, aggregates, windows etc.)

• Overlap with RDD API (batch) - Fault – Tolerant (exactly once semantics achievable) - Back Pressure - Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.)

!

Apache  Hadoop  Day  2015  

Page 3: Introduction to Structured Streaming

© 2015 IBM Corporation 3

Spark Streaming 1. X – Areas of improvement

! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink

Can that be handled in a very simple way for the end-user ?

Apache  Hadoop  Day  2015  

Page 4: Introduction to Structured Streaming

© 2015 IBM Corporation 4

Fault-Tolerant Semantics

Exactly  Once,  If  Outputs  are  Idempotent  or  transac6onal  

Exactly  Once,  as  long  as  received  data  is  not  lost  

Exactly  Once  needs  re-­‐playable  sources  (e.g.  Ka?a  Direct)  

Source

Receiver

Transforming

Outputting

Sink

Page 5: Introduction to Structured Streaming

© 2015 IBM Corporation 5

Spark Streaming 1. X – Areas of improvement

! Fault-tolerance - For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink

! API - Request for more seamless API between Batch & Stream - Reduce complexities of streaming app *

! No Event Time support - Hard to support when processing time/batch time exposed in externals

! Streaming Query Management

! Micro-batch

!

Apache  Hadoop  Day  2015  

Page 6: Introduction to Structured Streaming

© 2015 IBM Corporation 6

Spark Streaming 2.0 API

! Built on top of Spark SQL Engine

! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL.

! Challenge - Remove/Keep streaming complexities to minimum

!

Page 7: Introduction to Structured Streaming

© 2015 IBM Corporation 7

Lets Dive in

Page 8: Introduction to Structured Streaming

© 2015 IBM Corporation 8

SQL Batch vs SQL Streaming- Conceptually

Page 9: Introduction to Structured Streaming

© 2015 IBM Corporation 9

Batch vs Streaming - Programmatically

Page 10: Introduction to Structured Streaming

© 2015 IBM Corporation 10

Output Modes - Sink

! Defined as what gets written from the Result table to external storage (Sink)

! Output modes

- Complete – Entire updated Result table is written to external storage.

- Append – Only new rows added in the Result table since last incremental query execution is written to external storage.

- Update - Only the rows updated in the Result table since last incremental query execution is written to external storage.

Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode

Page 11: Introduction to Structured Streaming

© 2015 IBM Corporation 11

Supported Sinks & Modes in 2.0

*DEBUG  ONLY  

*DEBUG  ONLY  

Page 12: Introduction to Structured Streaming

© 2015 IBM Corporation 12

Windowing in Structured Streaming

Page 13: Introduction to Structured Streaming

© 2015 IBM Corporation 13

Window operations !  Continuous time based aggregations are most common in Streaming applications.

-  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes

! New function that treats windowing as a regular aggregation

!  Used in a Group By clause Can be used in Batch as well

Page 14: Introduction to Structured Streaming

© 2015 IBM Corporation 14

Event Time Windows ! Event-Time is time embedded within the data itself

It is not the time Spark received the data

! What about processing time windows if you want them

Page 15: Introduction to Structured Streaming

© 2015 IBM Corporation 15

Handling Late Arrival in Event-Time

! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group

! Use a normal filter in the SQL ?

! Watermarks

Page 16: Introduction to Structured Streaming

© 2015 IBM Corporation 16

Fault Tolerance

! Why Care?

! Different guarantees for Data Loss ! Atleast Once ! Exactly Once

! What all can fail? ! Driver ! Executor

Page 17: Introduction to Structured Streaming

© 2015 IBM Corporation 17

Spark 1.x Best Fault tolerance - Kafka Direct API

•  Simplified Parallelism

•  Less Storage Need

•  Exactly Once Semantics. source & processing

Benefits  of  this  approach  

Page 18: Introduction to Structured Streaming

© 2015 IBM Corporation 18

Fault Tolerance in Structured Streaming

Active Driver

Checkpoint  to  HDFS  

! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger

Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers

! On Recovery Restart processing of nth entry in WAL

Page 19: Introduction to Structured Streaming

© 2015 IBM Corporation 19

Fault Tolerance in Structured Streaming

! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC)

- Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842

Page 20: Introduction to Structured Streaming

© 2015 IBM Corporation 20

Managing Streaming Queries

!  Streaming in 1.x was definetly lacking in

- Starting / Stopping individual Streaming Queries

- Changing the computation done in a Query.

- When a Streaming Query abnormally terminates handle more gracefully than app crash.

Page 21: Introduction to Structured Streaming

© 2015 IBM Corporation 21

Managing Streaming Queries

Page 22: Introduction to Structured Streaming

© 2015 IBM Corporation 22

Managing Streaming Queries

Page 23: Introduction to Structured Streaming

© 2015 IBM Corporation 23

Summary !  Overall has a good set of features

-  Easier code share between Batch and Streaming (No different type hierarchies)

-  Window not tied to Batch interval

-  No Streaming context

-  Optimizer now available for your queries.

!  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around *

And not much control over those. -  Only get Runtime exceptions when you mess with above

!  How does it compare to Apache Beam ?

Page 24: Introduction to Structured Streaming

© 2015 IBM Corporation 24

For Each Sink

Page 25: Introduction to Structured Streaming

© 2015 IBM Corporation 25

Thank YOU