webinar: how to achieve high throughput for real-time applications with smack, apache kafka and...

33
High Throughput for Real-Time Applications with SMACK, Kafka and Spark Streaming Ryan Knight – Solution Engineer, DataStax @knight_cloud

Upload: datastax

Post on 16-Apr-2017

846 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

High Throughput for Real-Time Applications with SMACK, Kafka and Spark StreamingRyan Knight – Solution Engineer, DataStax@knight_cloud

Page 2: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

SparkMesosAkka

Cassandra

Kafka

Page 3: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

CassandraAkka

SparkKafka

Organize Process Store

Mesos

KafkaKafkaKafka SparkSparkSpark

AkkaAkkaAkka CassandraCassandraCassandra

Page 4: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Data Pipelines

Page 5: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

© 2015 DataStax, All Rights Reserved.

5

Move from Proactive to Predictive Analytics• Real time analytics of streaming data• Common use cases – fraud detection, login analysis, web traffic analysis, marketing data• High quality data pipeline = High quality data science• Difficult to deal with the scale and volume of data flowing through enterprises today

Page 6: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Spark Streaming – Predictive Analytics at Scale

• Kafka + Spark Streaming – Ideal tools for handling massive volumes of data

• Built to scale – easy to parallelize and distribute

• Resilient and Fault Tolerant – Ensure data is not lost

© 2015 DataStax, All Rights Reserved. 6

Page 7: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

How do we Scale for Load and Traffic?

© 2015 DataStax, All Rights Reserved. 7

Page 8: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Spark Streaming Micro Batches

© 2015 DataStax, All Rights Reserved. 8

Page 9: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

9© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 10: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

10© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 11: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Data Modeling using Event Sourcing• Append-Only Logging

• Database of Facts

• Snapshots or Roll-Ups

• Why Delete Data any more?

• Replay Events© 2015 DataStax, All Rights Reserved. 11

Page 12: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

12© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 13: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

• Common use case to join streaming data with lookup tables

• Broadcast Joins• joinWithCassandraTable• Use Data Frames to leverage catalyst optimizer

© 2015 DataStax, All Rights Reserved. 13

Avoid Network Shuffles

Page 14: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

14© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 15: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Tuning Spark Streaming • Processing Time

< Batch Duration

• Total Delay Grows Unbounded = Out Of Memory Errors

© 2015 DataStax, All Rights Reserved. 15

Page 16: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Batch Interval Gone Wrong

© 2015 DataStax, All Rights Reserved. 16

• Scheduling Delay of 41 Minutes!

Page 17: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Setting the Right Batch Interval

© 2015 DataStax, All Rights Reserved. 17

100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s

• Processing Time is consistently below our Batch Interval Time

• Good approach is to test with a conservative batch interval (e.g. 5-10 seconds) and a low data rate

• If the Total Delay is constantly under the Batch Interval, then the system is stable

Page 18: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

18© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 19: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Kafka High-level Review

19

Anatomy of a Topic

Writes

0 6541 2 3Partition 2

0 6 9541 2 3 7 8Partition 1

0 6 7541 2 3Partition 0

Old New Dataoffsets

Page 20: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Advantage of Kafka Direct API

© 2015 DataStax, All Rights Reserved. 20

20

Page 21: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Advantages Kafka Direct API

• Number of partitions per Kafka Topic = Degree of parallelism

• Simplifies Parallelism• Efficiency – single copy of data on read• Easier to work with• Resiliency without copying data

© 2015 DataStax, All Rights Reserved. 21

Page 22: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

1 Use Event Sourcing / Append Only Data Model

2 Avoid Network Shuffles

3 Tune Spark Streaming Processing Time

4 Use Kafka Direct API

5 Size Spark Streaming Batch Sizes

22© 2015 DataStax, All Rights Reserved.

5 Keys to Scaling Spark Streaming w/ Kafka

Page 23: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Reduce Processing Time by Increasing Parallelism

© 2015 DataStax, All Rights Reserved. 23

1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s 100 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s

Page 24: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Sizing Data Pipeline

• Look at the data flow for the entire pipeline• Benchmarking is key!• Calculate number of messages a single Spark

Streaming server can handle • Calculate number of messages flowing into

Kafka

© 2015 DataStax, All Rights Reserved. 24

Page 25: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Sizing Spark Streaming• Number of CPU Cores is the max number of

Parallel Tasks• RDD (Spark Data Type) internally divided

into Partitions based on data set size• Data transformation on one partition is a

task • Each CPU core can process one task© 2015 DataStax, All Rights Reserved. 25

Page 26: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Java Monitoring of Kafka with JConsole

© 2015 DataStax, All Rights Reserved. 26

Page 27: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Formula for Sizing Spark Streaming

© 2015 DataStax, All Rights Reserved. 27

Total Servers =

Example:

(# of Kafka Messages)

(# of Messages Streaming Server can Process)

100K

20K= Minimum of 5 Servers

Page 28: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Example Architecture

Page 29: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Spark at Scale

© 2015 DataStax, All Rights Reserved. 29

DataStax Enterprise Platform

Web Service

Legacy Systems

https://github.com/retroryan/sparkatscale

DataStax Enterprise Platform

Page 30: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Akka Feeder – Simulates Messagesval feederTick = context.system.scheduler.schedule(Duration.Zero, tickInterval, self, SendNextLine)

……case SendNextLine =>

val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, key, nxtRating.toString)val future = feederExtension.producer.send(record, new

Callback{ ….

© 2015 DataStax, All Rights Reserved. 30

Page 31: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Spark Streaming – Reading the Messagesval rawRatingsStream = KafkaUtils.createDirectStream ….. ……ratingsStream.foreachRDD { (message: RDD[Rating], batchTime: Time) => {

// convert each RDD from the batch into a Ratings DataFrame val ratingDF = message.toDF()

// save the DataFrame to Cassandra // Note: Cassandra has been initialized through dse spark-submit, so we don't have to explicitly set the connection ratingDF.write.format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace" -> "movie_db", "table" -> "rating_by_movie")) .save() }© 2015 DataStax, All Rights Reserved. 31

Page 32: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Coming Soon!• June 1: Building Data Pipelines with SMACK: Storage Strategy

using Cassandra and DSE

• July 6: Building Data Pipelines with SMACK: Analyzing Data with Spark

• For the latest schedule of webinars, check out our Webinars page: http://www.datastax.com/resources/webinars

© 2015 DataStax, All Rights Reserved. 32

Page 33: Webinar: How to Achieve High Throughput for Real-Time Applications with SMACK, Apache Kafka and Spark Streaming

Get your SMACK on!

Thank you!

Follow me on Twitter: @knight_cloud

© 2015 DataStax, All Rights Reserved. 33