spark streaming with apache kafka

Post on 13-Apr-2017

125 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Every ad.Every sales channel.Every screen.One platform.

Spark streaming with Apache kafka

Vikas Gite

Principal Software EngineerBig Data Analytics - PubMatic

2

Agenda

Spark streaming 101– What is RDD– What is Dstream

Spark streaming architecture Introduction to Kafka Streaming ingestion with Kafka

3

Spark streaming 101

RDD– Immutable– Partitioned– Fault tolerant– Lazily evaluated– Can be persisted

First RDD

Second RDD

Third RDD

Filter

Map

Lineage Graph

Spark streaming 101

4

DStream– Continuous sequence of RDDs– Designed for stream processing.

Spark streaming architecture Micro batching

5

Spark streaming architecture Dynamic load balancing

6

Ashish Tadose
stream data processing is always costlier in nature as compared to batch processing. So always make it sure that you take only required fields to stream processing platform.
Ashish Tadose
Data ingestion should provide capability to fork out another data stream from existing data flow of smaller size (only required fields) and pass on that stream to a different destination (a message buffer/queue).

Spark streaming architecture Failure and recovery

7

Introduction to Kafka Kafka is a message queue (Circular buffer) Based on disk space or time Oldest messages are deleted to maintain size Split into topic and partition Indexed only by offset Delivery semantics are your responsibility

8

High level consumer Offsets are stored in zookeeper Offsets are stored based on Consumer group

Low level consumer

Offsets are stored in any store Must handle broker leader changes

9

At most once Save offsets !!! Possible failure !!! Save results

On failure, restart at saved offset, messages are lost.

At least once Save results !!! Possible failure !!! Save offsets

On failure, messages are repeated

10

11

Idempotent exactly once Save result with natural unique key !!! Possible failure !!! Save offset

Operation is safe to repeat.

Pros : Simple Works well with map transformations

Cons : Hard for aggregate transformations

12

Transactional exactly once Begin transaction Save results Save offset Ensure offsets are ok Commit transaction

On failure roll back results and offsets

Pros : Works for any transformation

Cons : More complex Requires transactional data store

12

Streaming ingestion with KafkaApproach 1: Receiver-based Approach

13

Streaming ingestion with KafkaApproach 1: Receiver-based Approach

Pros : WAL design could work with non-kafka data store

Cons : Duplication of write operations Dependent on HDFS Must use idempotent for exactly once No access to offsets, can’t use transactional approach

14

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

15

16

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

Pros : Simplified parallelism

– One to one mapping between partition and RDD Efficiency

– Reducing WAL overhead Exactly-once semantics

– Spark checkpoints– Atomic transaction

17

Streaming ingestion with KafkaApproach 2: Direct Approach (How to use it)

// Kafka config paramsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,

“auto.offset.reset” -> largest)

// DirectStream method callval messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

18

Streaming ingestion with KafkaWhere to store offsets

Easy – Spark checkpoints : No need to access the offsets, automatically used on restart Must be idempotent, no transactional Checkpoints may not be recoverable

Complex – Your own data store : Must access offsets, save them, provid them on restart Idempotent or transactional Offsets are just as recoverable as your results

ad impressionsserved daily

bids processedmonthly

data processeddaily

data undermanagement

data centeracross geography

18B+10T

22TB

5PB6

Our Scale

Thank You

20

top related