cassandra + spark (you’ve got the lighter, let’s start a fire)

You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy

A journey throughDistributed Computing

KillrPowr Application

Show current power consumption in Germany

in real time

• Power consumption of all cities on a map

• Graph of data for one city

One word…

Apache CassandraDistributed & resilient database

Cassandra Use-Cases

• Time Series• Personalization• (or a mix of both)

Cassandra?

• Joining data … hmm• Analyzing the whole data set … hmm• Ad-Hoc Queries (SQL) … hmm

Apache SparkDistributed & resilient analytics

“Spark lets you run programs up to 100x faster in memory,or 10x faster on disk, than Hadoop.”

Spark : Simplify the…

challenging and compute-intensive task of processinghigh volumes of real-time or archived data,

both structured and unstructured,

seamlessly integrating relevantcomplex capabilities

such as machine learning andgraph algorithms.

Source: http://www.toptal.com/spark/introduction-to-apache-spark

Spark Use Cases

• Event detection• Behavior / anomalies detection• Fraud & intrusion detection• Targeted advertising• E-commerce transaction processing• Recommendations

Spark Use Cases

• Data migration• Ad hoc queries• Business Intelligence

Spark Architecture

Spark RDDs

Resilient Distributed Datasets

• Fault-tolerant collection of elements• Can be operated on in parallel• Immutable• In-Memory (as much as possible)

Spark RDD Partitioning

Spark RDD Resiliency

Spark Transformations

Create a new data set from an existing one

Spark Actions

Return a value after running a computation

Spark DAG

Spark Example (word count)

Given is a text file with movies and their categories.We want to know the occurences of each genre.

Pirates of the Carribbean,Adventure,Action,FantasyStar Trek I,SciFi,Adventure,ActionDie Hard,Action,Thriller

Why Scala

Scala is a functional language

Analytics is a functional “problem“

So – use a functional language

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Anonymous Scala Function

Flat the collection

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Pair RDD

Values of TWO Pair RDDs

Cassandra-Spark-Driver

Spark-Cassandra-Connector

• Scala, Java• Open Source (ASF2)• Developed by DataStax and community guys

https://github.com/datastax/spark-cassandra-connector

Spark Batch Processing

Other REALLY useful stuff like

• Re-partition data & processing• Allows efficient use of Cassandra

SQL and Data Frames

Spark SQL

csc.sql(" SELECT COUNT(*) AS total " +" FROM killr_video.movies_by_actor " +" WHERE actor = 'Johnny Depp' ")

+-----+|total|+-----+| 54|+-----+

Spark Data Frames

val movies = csc.read.format("org.apache.spark.sql.cassandra").options(Map( "keyspace" -> "killr_video",

"table" -> "movies_by_actor" )).load

movies.filter("actor = 'Johnny Depp'").agg(Map("*" -> "count")).withColumnRenamed("COUNT(1)", "total").show

Same as:

SELECT COUNT(*) AS totalFROM killr_video.movies_by_actorWHERE actor = 'Johnny Depp'

Spark Data Frames

Data frame is RDD that has named columns

• Has schema• Has rows• Has columns• Has data types

Spark Straming

Spark Streaming

Continuous Micro Batching

• Collect data of time slides• Compute on that data

Spark Streaming

Spark Streaming Architecture

Spark Streaming

Spark Streaming Windows

Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide

00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10

Window

Process data for this slide –i.e. 1 second in this example

Time window ~=Grouping data by time

Apache KafkaDistributed & resilient messaging

Publish-subscribe messagingrethought as a distributed commit log

Kafka ClusterProducer

Producer

Consumer

Kafka Node

Kafka Use-Cases

• Messaging• Website activity tracking• Metrics• Log aggregation• Event sourcing• Stream processing

Common Characteristics

Kafka, Spark & Cassandra are

• Scalable• Distributed• Partitioned• Replicated

KillrPowr

Show current power consumption in Germanyin real time

• Power consumption of all cities on a map• Graph of data for a selected city

KillrPowr – Data modeling 101

1. Collect UI Requirements2. Model data (flow) per web UI request3. Build Spark Streaming code to generate that data4. Connect sensors and Spark Streaming using Kafka

KillrPowr – Tech Stack

Power Sensor

Kafka Spark Streaming

Cassandra

Web Site

Queue dataAggregate data

and savefor web UI

KillrPowr – Tech Stack

Techs used in the demo

• Spray (HTTP for Akka Actors)• AngularJS• Kafka• Spark Streaming• Cassandra

• Push down predicates to Cassandra

• Partition your data “right”• Think distributed

Cassandra + Spark in DSE

DSE integratesSpark + Cassandra

https://academy.datastax.com/courses/getting-started-apache-spark

KillrPowr Live Demo

You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy

Thank You

BACKUP SLIDES

Confidential 56

BACKUP - DDL

BACKUP – Spark Streaming

cassandra + spark (you’ve got the lighter, let’s start a fire)

Software

cassandra summit 2014: performance tuning cassandra in aws

cassandra community webinar: apache cassandra internals

cassandra community webinar | cassandra 2.0 - better,...

cassandra summit 2014: cassandra at instagram 2014

apache cassandra and spark. you got the the lighter, let's...

solr & cassandra: searching cassandra with datastax...

a guide to stress testing kafka, spark and cassandra … ·...

running cassandra on amazon’s ecs -...

apache cassandra at target - cassandra summit 2014

apache cassandra in bangalore - cassandra internals and...

cassandra day atlanta 2015: troubleshooting with apache...

cassandra summit eu 2014 - testing cassandra applications

analytics with cassandra, spark & mllib - cassandra...

storage on ec2 (& cassandra), cassandra workshop, berlin...

cassandra summit 2014: cassandra compute cloud: an elastic...

cabs, cassandra, and hailo (at cassandra eu)

paris cassandra meetup - cassandra for developers

apache cassandra in action - o'reilly...

cassandra community webinar - august 22 2013 - cassandra...

cassandra sf 2015 - repeatable, scalable, reliable,...