spark streaming with apache kafka

20
Every ad. Every sales channel. Every screen. One platform. Spark streaming with Apache kafka Vikas Gite Principal Software Engineer Big Data Analytics - PubMatic

Upload: punesparkmeetup

Post on 13-Apr-2017

125 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Spark streaming with apache kafka

Every ad.Every sales channel.Every screen.One platform.

Spark streaming with Apache kafka

Vikas Gite

Principal Software EngineerBig Data Analytics - PubMatic

Page 2: Spark streaming with apache kafka

2

Agenda

Spark streaming 101– What is RDD– What is Dstream

Spark streaming architecture Introduction to Kafka Streaming ingestion with Kafka

Page 3: Spark streaming with apache kafka

3

Spark streaming 101

RDD– Immutable– Partitioned– Fault tolerant– Lazily evaluated– Can be persisted

First RDD

Second RDD

Third RDD

Filter

Map

Lineage Graph

Page 4: Spark streaming with apache kafka

Spark streaming 101

4

DStream– Continuous sequence of RDDs– Designed for stream processing.

Page 5: Spark streaming with apache kafka

Spark streaming architecture Micro batching

5

Page 6: Spark streaming with apache kafka

Spark streaming architecture Dynamic load balancing

6

Ashish Tadose
stream data processing is always costlier in nature as compared to batch processing. So always make it sure that you take only required fields to stream processing platform.
Ashish Tadose
Data ingestion should provide capability to fork out another data stream from existing data flow of smaller size (only required fields) and pass on that stream to a different destination (a message buffer/queue).
Page 7: Spark streaming with apache kafka

Spark streaming architecture Failure and recovery

7

Page 8: Spark streaming with apache kafka

Introduction to Kafka Kafka is a message queue (Circular buffer) Based on disk space or time Oldest messages are deleted to maintain size Split into topic and partition Indexed only by offset Delivery semantics are your responsibility

8

Page 9: Spark streaming with apache kafka

High level consumer Offsets are stored in zookeeper Offsets are stored based on Consumer group

Low level consumer

Offsets are stored in any store Must handle broker leader changes

9

Page 10: Spark streaming with apache kafka

At most once Save offsets !!! Possible failure !!! Save results

On failure, restart at saved offset, messages are lost.

At least once Save results !!! Possible failure !!! Save offsets

On failure, messages are repeated

10

Page 11: Spark streaming with apache kafka

11

Idempotent exactly once Save result with natural unique key !!! Possible failure !!! Save offset

Operation is safe to repeat.

Pros : Simple Works well with map transformations

Cons : Hard for aggregate transformations

Page 12: Spark streaming with apache kafka

12

Transactional exactly once Begin transaction Save results Save offset Ensure offsets are ok Commit transaction

On failure roll back results and offsets

Pros : Works for any transformation

Cons : More complex Requires transactional data store

12

Page 13: Spark streaming with apache kafka

Streaming ingestion with KafkaApproach 1: Receiver-based Approach

13

Page 14: Spark streaming with apache kafka

Streaming ingestion with KafkaApproach 1: Receiver-based Approach

Pros : WAL design could work with non-kafka data store

Cons : Duplication of write operations Dependent on HDFS Must use idempotent for exactly once No access to offsets, can’t use transactional approach

14

Page 15: Spark streaming with apache kafka

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

15

Page 16: Spark streaming with apache kafka

16

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

Pros : Simplified parallelism

– One to one mapping between partition and RDD Efficiency

– Reducing WAL overhead Exactly-once semantics

– Spark checkpoints– Atomic transaction

Page 17: Spark streaming with apache kafka

17

Streaming ingestion with KafkaApproach 2: Direct Approach (How to use it)

// Kafka config paramsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,

“auto.offset.reset” -> largest)

// DirectStream method callval messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

Page 18: Spark streaming with apache kafka

18

Streaming ingestion with KafkaWhere to store offsets

Easy – Spark checkpoints : No need to access the offsets, automatically used on restart Must be idempotent, no transactional Checkpoints may not be recoverable

Complex – Your own data store : Must access offsets, save them, provid them on restart Idempotent or transactional Offsets are just as recoverable as your results

Page 19: Spark streaming with apache kafka

ad impressionsserved daily

bids processedmonthly

data processeddaily

data undermanagement

data centeracross geography

18B+10T

22TB

5PB6

Our Scale

Page 20: Spark streaming with apache kafka

Thank You

20