spark streaming with apache kafka

Every ad.Every sales channel.Every screen.One platform.

Spark streaming with Apache kafka

Vikas Gite

Principal Software EngineerBig Data Analytics - PubMatic

Agenda

Spark streaming 101– What is RDD– What is Dstream

Spark streaming architecture Introduction to Kafka Streaming ingestion with Kafka

Spark streaming 101

RDD– Immutable– Partitioned– Fault tolerant– Lazily evaluated– Can be persisted

First RDD

Second RDD

Third RDD

Filter

Lineage Graph

Spark streaming 101

DStream– Continuous sequence of RDDs– Designed for stream processing.

Spark streaming architecture Micro batching

Spark streaming architecture Dynamic load balancing

Spark streaming architecture Failure and recovery

Introduction to Kafka Kafka is a message queue (Circular buffer) Based on disk space or time Oldest messages are deleted to maintain size Split into topic and partition Indexed only by offset Delivery semantics are your responsibility

High level consumer Offsets are stored in zookeeper Offsets are stored based on Consumer group

Low level consumer

Offsets are stored in any store Must handle broker leader changes

At most once Save offsets !!! Possible failure !!! Save results

On failure, restart at saved offset, messages are lost.

At least once Save results !!! Possible failure !!! Save offsets

On failure, messages are repeated

Idempotent exactly once Save result with natural unique key !!! Possible failure !!! Save offset

Operation is safe to repeat.

Pros : Simple Works well with map transformations

Cons : Hard for aggregate transformations

Transactional exactly once Begin transaction Save results Save offset Ensure offsets are ok Commit transaction

On failure roll back results and offsets

Pros : Works for any transformation

Cons : More complex Requires transactional data store

Streaming ingestion with KafkaApproach 1: Receiver-based Approach

Pros : WAL design could work with non-kafka data store

Cons : Duplication of write operations Dependent on HDFS Must use idempotent for exactly once No access to offsets, can’t use transactional approach

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

Pros : Simplified parallelism

– One to one mapping between partition and RDD Efficiency

– Reducing WAL overhead Exactly-once semantics

– Spark checkpoints– Atomic transaction

Streaming ingestion with KafkaApproach 2: Direct Approach (How to use it)

// Kafka config paramsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,

“auto.offset.reset” -> largest)

// DirectStream method callval messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

Streaming ingestion with KafkaWhere to store offsets

Easy – Spark checkpoints : No need to access the offsets, automatically used on restart Must be idempotent, no transactional Checkpoints may not be recoverable

Complex – Your own data store : Must access offsets, save them, provid them on restart Idempotent or transactional Offsets are just as recoverable as your results

ad impressionsserved daily

bids processedmonthly

data processeddaily

data undermanagement

data centeracross geography

18B+10T

Our Scale

Thank You

spark streaming with apache kafka

Technology

streaming mitapache kafka - jug saxony day · streaming mit...

oracle goldengate and apache kafka: a deep dive into...

advanced streaming analytics with apache flink and apache...

streaming data ingest and processing with apache kafka

mqtt kafka bridge · what is apache kafka? • a...

building streaming data applications using apache kafka

12062018 the unmatchable roi of managed …...the...

webinar: data streaming with apache kafka & mongodb

data streaming with apache kafka & mongodb

data streaming with apache kafka & mongodb - emea

real-time streaming and data pipelines with apache kafka

amazon managed streaming para apache kafka - guía para...

best practices for developing apache kafka applications on...

[big data spain] apache spark streaming + kafka 0.10: an...

sed370 - kafka cloud - software engineering daily · sed...

using amq streams data streaming with apache …data...

apache kafka - masaryk...

vehicle data analysis using cloud-based stream...

ims cdc to kafka performance and tuning of apache kafka key...

oci streaming service level 100apache kafka is an open...