interactive data analysis in spark streaming
TRANSCRIPT
![Page 1: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/1.jpg)
Interactive Data Analysis in Spark Streaming
Ways to interact with Spark Streaming applicationshttps://github.com/Shasidhar/dynamic-streaming
![Page 2: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/2.jpg)
● Shashidhar E S
● Big data consultant and trainer at datamantra.io
● www.shashidhare.com
![Page 3: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/3.jpg)
Agenda
● Big data applications● Streaming applications● Categories in streaming applications● Interactive streaming applications● Different strategies● Zookeeper● Apache curator● Curator cache types● Spark streaming context dynamic switch
![Page 4: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/4.jpg)
Big data applications
● Typically applications in big data are divided depending on their work loads
● Major divisions are○ Batch applications○ Streaming applications
● Most of the existing systems support both of these applications ● But there is a new category of applications are in raise, they are
known as interactive applications
![Page 5: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/5.jpg)
Big data interactive applications
● Ability to manipulate data in interactive way● Exploratory in nature● Combines batch and streaming data● For development
○ Zeppelin, Jupiter Notebook● For production
○ Batch - Datameer, Tellius, Zoomdata○ Streaming - Stratio Engine, WSO2
![Page 6: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/6.jpg)
Streaming Engines
● Ability to process data in real time● Streaming process includes
○ Collecting data ○ Processing data
● Types of streaming engines○ Real time○ Near real time - (Micro Batch)
● Spark allows near real time streaming processing
![Page 7: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/7.jpg)
Streaming application types/categories
● Streaming ETL processes● Decision engines
○ Rule based○ Machine Learning based
■ Online learning● Real time Dashboards● Root cause analysis engines
○ Multiple Streams○ Handling event times
![Page 8: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/8.jpg)
Streaming applications in real world
● Static○ Data scientist defines rules○ Admin sets up dashboard○ Not able to modify the behaviour of streaming application
● Dynamic○ User can add/delete/modify rules , controls the decision○ User can see some charts/ design charts ○ Ability to modify the behaviour of streaming application
![Page 9: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/9.jpg)
Generic Interactive Application Architecture
Spark Streaming
Streaming data source
Streaming Config source
Downstream applications
How do we make the configuration dynamic?
![Page 10: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/10.jpg)
Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams
![Page 11: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/11.jpg)
Micro batch
● Spark streaming is a fast batch processing system● Spark streaming collects stream data into small batchand runs batch processing on it● Batch can be as small as 1s to as big as multiple hours● Spark job creation and execution overhead is so low it
can do all that under a sec● These batches are called as DStreams
![Page 12: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/12.jpg)
Spark Streaming application
Define Input streams
Define data Processing
Define data sync
Micro Batch
Start Streaming Context
● Options to change behaviour○ Restart context○ Without restarting context
■ Control configuration data
Create Streaming Context
![Page 13: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/13.jpg)
Interactive Streaming Application Strategies
![Page 14: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/14.jpg)
Using Kafka as configuration source
Spark Streaming
Streaming data source
Config source(Kafka)
Downstream applications
![Page 15: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/15.jpg)
Using Kafka as Configuration Source
● Easy to adapt as Kafka is the defacto streaming store● Streaming configuration Source
○ New stream to track the configuration changes● Spark Streaming
○ Maintain configuration as state in memory and apply○ State needs to be checkpointed○ Failure recovery strategies need to be taken care of
● Drawbacks○ Hard to handle deletes/updates in state○ Tricky to handle state if configurations are complex
![Page 16: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/16.jpg)
Using Database as configuration source
Spark Streaming
Streaming data source
Distributed Database
Downstream applications
Workers
![Page 17: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/17.jpg)
Interactive streaming application Strategies contd.● Easy to start with databases, as people are familiar with it● Configuration Source
○ Distributed data source● Spark Streaming
○ Read configuration from database and apply - Polling○ Database need to be consistent and fast○ Configurations can be kept in cache to avoid latencies
● Drawbacks○ Achieving distributed cache consistency is tricky○ May be an extra component if you have it only for this purpose
![Page 18: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/18.jpg)
Using Zookeeper as configuration source
Spark Streaming
Streaming data source
Zookeeper
Downstream applications
![Page 19: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/19.jpg)
Interactive streaming application Strategies contd.
● It is readily available if Kafka is used in a system, no extra burden● Configuration Source - Zookeeper● Spark streaming
○ Ability to track the configuration change and take action - Async Callbacks
○ Suitable to store any type of configuration○ Allows to adapt listeners for configuration changes○ Ensures cache consistency by default
● Drawbacks○ Streaming context restart is not suitable for all systems
![Page 20: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/20.jpg)
Apache Zookeeper
“Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers”
● Distributed Coordination service ● Hierarchical file system● Data is stored in ZNode● Can be thought as a “distributed in-memory file system” with some
limitations like size of data, optimized of high reads and low writes
![Page 21: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/21.jpg)
Zookeeper Architecture
![Page 22: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/22.jpg)
Zookeeper data model
● Follows hierarchical namespace● Each node is called as ZNode
○ Data saved as bytes○ Can have children○ Only accessible through absolute paths○ Data size limited to 1MB
● Follows global ordering
![Page 23: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/23.jpg)
ZNode
● Types○ Persistent Nodes
■ Exists till explicitly deleted■ Can have children
○ Ephemeral nodes■ Exists as long as session is active■ Cannot have children
● Data can be secured at ZNode level with ACL
![Page 24: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/24.jpg)
Data consistency
● Reads are served from local servers● Writes are synchronised through leader● Ensures Sequential Consistency● Data is either read completely or fails● All clients gets the same result irrespective of the server it is
connected● Updates are persisted, unless overridden by any client
![Page 25: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/25.jpg)
Zookeeper Watches
● Available watches in Zookeeper○ Node Children Changed○ Node Changed○ Node Data Changed○ Node Deleted
● Watchers are one time triggers● Event is always received first rather than data● Client can re register for watch if needed again
![Page 26: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/26.jpg)
Zookeeper Client example
![Page 27: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/27.jpg)
ZK API issues
● Making client code thread safe is tricky● Hard for programmers● Exception handling is bad● Similar to MapReduce API
Solution is “Apache Curator”
![Page 28: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/28.jpg)
Apache Curator
● A Zookeeper Keeper● Main components
○ Client - Wrapper for ZK class, manages Zookeeper connection○ Framework - High level API that encloses all ZK related
operations, handles all types of retries. ○ Recipes - Implementation of common Zookeeper “recipes” built of
top of curator framework● User friendly API
![Page 29: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/29.jpg)
Curator Hands on - Basic OperationsGit branch : zookeeperexamples
![Page 30: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/30.jpg)
Apache Curator caches
Three types of caches
● Node Cache ○ Monitors single node
● Path Cache○ Monitors a ZNode and children
● Tree Cache○ Monitors a ZK Path by caching data locally
![Page 31: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/31.jpg)
Curator Hands on - Node Cache Git branch : zookeeperexamples
![Page 32: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/32.jpg)
Path Cache
● Monitor a ZNode● Using archaius - a dynamic property library● Use ConfigurationSource from archaius to track changes● Pair Configuration source with UpdateListener● See in action
Watched DataSourceUpdate Listener
![Page 33: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/33.jpg)
Curator Hands on - Path CacheGit branch : zookeeperlistener
![Page 34: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/34.jpg)
Spark streaming dynamic restart
● Use the same WatchedSource to track any changes in configuration● Track changes on zookeeper with patch cache ● Control Streaming context restart on ZK data change
Watched DataSource
Update Listener(Restart context)
![Page 35: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/35.jpg)
Hands on - Streaming RestartGit branch : streaming-listener
![Page 36: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/36.jpg)
Ways to control data loss
● Enable checkpointing● Track Kafka topic offsets manually● Better to use Direct kafka input streams● Use Kafka monitoring tools to see the status of data processing● Always create spark streaming context from checkpoint directory
Next Steps● Try to add some meaningful configurations● Implement the same idea with Akka actors
![Page 37: Interactive Data Analysis in Spark Streaming](https://reader035.vdocuments.us/reader035/viewer/2022062401/587138bd1a28abf0568b641f/html5/thumbnails/37.jpg)
References● http://sysgears.com/articles/managing-configuration-of-distributed-system-wit
h-apache-zookeeper/● https://github.com/Netflix/archaius/wiki/ZooKeeper-Dynamic-Configuration