cassandra + spark (you’ve got the lighter, let’s start a fire)

Post on 14-Feb-2017

369 Views

Category:

Software

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy

A journey throughDistributed Computing

© 2016 DataStax, All Rights Reserved. 2

KillrPowr Application

© 2016 DataStax, All Rights Reserved. 3

Show current power consumption in Germany

in real time

• Power consumption of all cities on a map

• Graph of data for one city

One word…

© 2016 DataStax, All Rights Reserved. 4

Apache CassandraDistributed & resilient database

Cassandra Use-Cases

© 2016 DataStax, All Rights Reserved. 6

• Time Series• Personalization• (or a mix of both)

Cassandra?

© 2016 DataStax, All Rights Reserved. 7

• Joining data … hmm• Analyzing the whole data set … hmm• Ad-Hoc Queries (SQL) … hmm

Apache SparkDistributed & resilient analytics

“Spark lets you run programs up to 100x faster in memory,or 10x faster on disk, than Hadoop.”

Spark : Simplify the…

© 2016 DataStax, All Rights Reserved. 9

challenging and compute-intensive task of processinghigh volumes of real-time or archived data,

both structured and unstructured,

seamlessly integrating relevantcomplex capabilities

such as machine learning andgraph algorithms.

Source: http://www.toptal.com/spark/introduction-to-apache-spark

Spark Use Cases

© 2016 DataStax, All Rights Reserved. 10

• Event detection• Behavior / anomalies detection• Fraud & intrusion detection• Targeted advertising• E-commerce transaction processing• Recommendations

Spark Use Cases

© 2016 DataStax, All Rights Reserved. 11

• Data migration• Ad hoc queries• Business Intelligence

Spark

© 2016 DataStax, All Rights Reserved. 12

Spark Architecture

© 2016 DataStax, All Rights Reserved. 13

Spark RDDs

© 2016 DataStax, All Rights Reserved. 14

Resilient Distributed Datasets

• Fault-tolerant collection of elements• Can be operated on in parallel• Immutable• In-Memory (as much as possible)

Spark RDD Partitioning

© 2016 DataStax, All Rights Reserved. 15

Spark RDD Resiliency

© 2016 DataStax, All Rights Reserved. 16

Spark Transformations

© 2016 DataStax, All Rights Reserved. 17

Create a new data set from an existing one

Spark Actions

© 2016 DataStax, All Rights Reserved. 18

Return a value after running a computation

Spark DAG

© 2016 DataStax, All Rights Reserved. 19

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 20

Given is a text file with movies and their categories.We want to know the occurences of each genre.

Pirates of the Carribbean,Adventure,Action,FantasyStar Trek I,SciFi,Adventure,ActionDie Hard,Action,Thriller

Why Scala

© 2016 DataStax, All Rights Reserved. 21

Scala is a functional language

Analytics is a functional “problem“

So – use a functional language

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 22

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 23

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Anonymous Scala Function

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 24

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

Flat the collection

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 25

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)Pair RDD

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 26

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

Values of TWO Pair RDDs

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 27

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

Spark Example (word count)

© 2016 DataStax, All Rights Reserved. 28

sc.textFile("file:///home/videos.csv").flatMap(line => line.split(",").drop(1)).map(genre => (genre,1)).reduceByKey{case (x,y) => x + y}.collect().foreach(println)

Cassandra-Spark-Driver

Spark-Cassandra-Connector

© 2016 DataStax, All Rights Reserved. 30

• Scala, Java• Open Source (ASF2)• Developed by DataStax and community guys

https://github.com/datastax/spark-cassandra-connector

Spark Batch Processing

© 2016 DataStax, All Rights Reserved. 31

Other REALLY useful stuff like

• Re-partition data & processing• Allows efficient use of Cassandra

SQL and Data Frames

Spark SQL

© 2016 DataStax, All Rights Reserved. 33

csc.sql(" SELECT COUNT(*) AS total " +" FROM killr_video.movies_by_actor " +" WHERE actor = 'Johnny Depp' ")

.show

+-----+|total|+-----+| 54|+-----+

Spark Data Frames

© 2016 DataStax, All Rights Reserved. 34

val movies = csc.read.format("org.apache.spark.sql.cassandra").options(Map( "keyspace" -> "killr_video",

"table" -> "movies_by_actor" )).load

movies.filter("actor = 'Johnny Depp'").agg(Map("*" -> "count")).withColumnRenamed("COUNT(1)", "total").show

Same as:

SELECT COUNT(*) AS totalFROM killr_video.movies_by_actorWHERE actor = 'Johnny Depp'

Spark Data Frames

© 2016 DataStax, All Rights Reserved. 35

Data frame is RDD that has named columns

• Has schema• Has rows• Has columns• Has data types

Spark Straming

Spark Streaming

© 2016 DataStax, All Rights Reserved. 37

Continuous Micro Batching

• Collect data of time slides• Compute on that data

Spark Streaming

© 2016 DataStax, All Rights Reserved. 38

Spark Streaming Architecture

Confid© 2016 DataStax, All Rights Reserved.ential

39

Spark Streaming

© 2016 DataStax, All Rights Reserved. 40

Spark Streaming Windows

© 2016 DataStax, All Rights Reserved. 41

time

Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide Slide

00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:10

Window

Window

Window

Window

Window

Window

Window

Window

Window

Process data for this slide –i.e. 1 second in this example

Process data for this slide –i.e. 1 second in this example

Process data for this slide –i.e. 1 second in this example

Time window ~=Grouping data by time

Apache KafkaDistributed & resilient messaging

Kafka

© 2016 DataStax, All Rights Reserved. 43

Publish-subscribe messagingrethought as a distributed commit log

Kafka

© 2016 DataStax, All Rights Reserved. 44

Kafka ClusterProducer

Producer

Producer

Consumer

Consumer

Kafka Node

Kafka Node

Kafka Node

Kafka Use-Cases

© 2016 DataStax, All Rights Reserved. 45

• Messaging• Website activity tracking• Metrics• Log aggregation• Event sourcing• Stream processing

Common Characteristics

© 2016 DataStax, All Rights Reserved. 46

Kafka, Spark & Cassandra are

• Scalable• Distributed• Partitioned• Replicated

KillrPowr

© 2016 DataStax, All Rights Reserved. 47

Show current power consumption in Germanyin real time

• Power consumption of all cities on a map• Graph of data for a selected city

KillrPowr – Data modeling 101

© 2016 DataStax, All Rights Reserved. 48

1. Collect UI Requirements2. Model data (flow) per web UI request3. Build Spark Streaming code to generate that data4. Connect sensors and Spark Streaming using Kafka

KillrPowr – Tech Stack

© 2016 DataStax, All Rights Reserved. 49

Power Sensor

Power Sensor

Kafka Spark Streaming

Cassandra

Web Site

Queue dataAggregate data

and savefor web UI

KillrPowr – Tech Stack

© 2016 DataStax, All Rights Reserved. 50

Techs used in the demo

• Spray (HTTP for Akka Actors)• AngularJS• Kafka• Spark Streaming• Cassandra

Tips

© 2016 DataStax, All Rights Reserved. 51

• Push down predicates to Cassandra

• Partition your data “right”• Think distributed

Cassandra + Spark in DSE

© 2016 DataStax, All Rights Reserved. 52

DSE integratesSpark + Cassandra

https://academy.datastax.com/courses/getting-started-apache-spark

KillrPowr Live Demo

© 2016 DataStax, All Rights Reserved. 53

You’ve got the lighter, let’s start a fire:Spark and CassandraRobert StuppSolutions Architect @ DataStax, Committer to Apache Cassandra@snazy

Thank You

© 2016 DataStax, All Rights Reserved. 55

BACKUP SLIDES

Confidential 56

BACKUP - DDL

© 2014 DataStax, All Rights Reserved. Company Confidential 57

BACKUP - DDL

© 2014 DataStax, All Rights Reserved. Company Confidential 58

BACKUP - DDL

© 2014 DataStax, All Rights Reserved. Company Confidential 59

BACKUP - DDL

© 2014 DataStax, All Rights Reserved. Company Confidential 60

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 61

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 62

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 63

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 64

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 65

BACKUP – Spark Streaming

© 2014 DataStax, All Rights Reserved. Company Confidential 66

© 2014 DataStax, All Rights Reserved. Company Confidential 67

top related