fully fault tolerant streaming workflows at scale using apache mesos & spark streaming
TRANSCRIPT
Fully Fault tolerant Streaming Workflows at
Scale using Apache Mesos & Spark Streaming
About Me and Sigmoid
● GitHub: github.com/akhld
● Twitter: @AkhlD
● Email: [email protected] OUR CUSTOMERS
Overview
● Apache Spark
● Spark Streaming
● High Availability Mesos Cluster
● Running Spark Streaming over a High Availability Mesos Cluster
● Simple Fault-tolerant Streaming Pipeline
● Scaling the pipeline
Apache Spark
Spark Stack
Resilient Distributed Datasets (RDDs)
- Big collection of data which is:
- Immutable
- Distributed
- Lazily evaluated
- Type Inferred
- Cacheable
RDD1 RDD2 RDD3
Why Spark Streaming?
Many big-data applications need to process large data streams in near-real time
Monitoring Systems
Alert Systems
Computing Systems
What is Spark Streaming?
Framework for large scale stream processing
➔ Created at UC Berkeley by Tathagata Das (TD)
➔ Scales to 100s of nodes
➔ Can achieve second scale latencies
➔ Provides a simple batch-like API for implementing complex algorithm
➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.
Framework (SparkStreamingJob)
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
SparkStreaming
Spark
- Chop up the live stream into batches of X seconds
- Spark treats each batch of data as RDDs and processes them using RDD operations
- Finally, the processed results of the RDD operations are returned in batches
Kafka Server
Simple Streaming Pipeline
Spark Streaming
Standalone Spark Cluster
Storage (HDFS/DB)
Point of Failure
Mesos High Availability Cluster
Masters Quorum
Leader
Standby Standby
SparkStreamingJob
ExecutorTask
ExecutorTask
HadoopJob
Slave 1
Slave N
Offer
Offer
Framework (SparkStreamingJob)
Driver program
Scheduler
Offer
Spark Streaming over a HA Mesos Cluster
● To use Mesos from Spark, you need a Spark binary package available in a place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos.
● Configuring the driver program to connect to Mesos:
val sconf = new SparkConf() .setMaster("mesos://zk://10.121.93.241:2181,10.181.2.12:2181,10.107.48.112:2181/mesos") .setAppName("MyStreamingApp") .set("spark.executor.uri","hdfs://Sigmoid/executors/spark-1.3.0-bin-hadoop2.4.tgz") .set("spark.mesos.coarse", "true") .set("spark.cores.max", "30") .set("spark.executor.memory", "10g")
val sc = new SparkContext(sconf) val ssc = new StreamingContext(sc, Seconds(1)) ...
Spark Streaming Fault-tolerance
Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system.
● Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster.
● In Streaming, driver failure can be recovered with checkpointing application state.● Write Ahead Logs (WAL) & Acknowledgements can ensure 0 data loss.
Kafka Cluster
Simple Fault-tolerant Streaming Infra
Spark Streaming
Storage (HDFS/DB)
NODES
High Availability Mesos Cluster
ExecutorTask
SparkStreamingJob
Scaling the pipeline
Spark StreamingStorage
(HDFS/DB)
NODES
High Availability Mesos Cluster
Understanding the bottlenecks
- Network : 1Gbps- # Cores/Slave : 4- DISK IO : 100MB/S on SSD
Goal:- Receive & Process data at 1M events/Second
Choosing the correct # Resources
- Since single slave can handle up to 100MB/S network and disk IO, a minimal of 6 slaves could take me to ~600MB/S
Kafka Cluster
Thank You&
Queries??
Read more: https://www.sigmoid.com/fault-tolerant-streaming-workflows/