real-time analytics with apache cassandra and apache spark

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH

Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz

Guido Schmutz

• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data• Technology Manager @ Trivadis

• More than 25 years of software development experience

• Contact: guido.schmutz@trivadis.com• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz

Agenda

1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

What is Real-Time Analytics?

What is it? Why do we need it?

How does it work?• Collect real-time data• Process data as it flows in• Data in Motion over Data at

Rest• Reports and Dashboard

access processed data TimeEvents RespondAnalyze

Shorttimetoanalyze&respond

§ Required - fornewbusinessmodels

§ Desired - forcompetitiveadvantage

Real Time Analytics Use Cases

• Algorithmic Trading

• Online Fraud Detection

• Geo Fencing

• Proximity/Location Tracking

• Intrusion detection systems

• Traffic Management

• Recommendations

• Churn detection

• Internet of Things (IoT) / Intelligence

Sensors

• Social Media/Data Analytics

• Gaming Data Feed

• …

Apache Spark

Motivation – Why Apache Spark?

Hadoop MapReduce: Data Sharing on Disk

Spark: Speed up processing by using Memory instead of Disks

map reduce . . .Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

op1 op2 . . .Input

Output

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

Search NoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers DataStores

Resilient Distributed Dataset (RDD)

Are• Immutable• Re-computable• Fault tolerant• Reusable

Have Transformations• Produce new RDD• Rich set of transformation available

• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

Have Actions• Start cluster computing operations• Rich set of action available

• collect(), count(), fold(), reduce(), count(), …

RDD RDD

Input Source

• File• Database• Stream• Collection

.count() ->100

Partitions RDD

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Partitions RDD

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Partitions RDD

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server2

Server3

Server4

Server5

Stage 1 – reduceByKey()

Stage 1 – flatMap() + map()

Spark Workflow InputHDFSFile

HadoopRDD

MappedRDD

ShuffledRDD

TextFileOutput

sc.hapoopFile()

reduceByKey()

sc.saveAsTextFile()

Transformations(Lazy)

Action(Execute

Transformations)

Master

MappedRDD

ShuffledRDD

MappedRDD

flatMap()

DAGScheduler

Spark Workflow HDFSFileInput1

HadoopRDD

FilteredRDD

MappedRDD

ShuffledRDD

HDFSFileOutput

HadoopRDD

MappedRDD

HDFSFileInput2SparkContext.hadoopFile()

SparkContext.hadoopFile()filter()

map() map()

join()

SparkContext.saveAsHadoopFile()

Transformations(Lazy)

Action(ExecuteTransformations)

Spark Execution Model

DataStorage

Worker

Master

Executer

Server

Executer

Stage 1 – flatMap() + map()

DataStorage

Worker

Master

Executer

DataStorage

Worker

Executer

DataStorage

Worker

Executer

NarrowTransformationMaster

filter()map()sample()flatMap()

DataStorage

Worker

Executer

Stage 2 – reduceByKey()

DataStorage

Worker

Executer

DataStorage

Worker

Executer

WideTransformation

Master

join()reduceByKey()union()groupByKey()

Shuffle!

DataStorage

Worker

Executer

DataStorage

Worker

Executer

Batch vs. Real-Time Processing

PetabytesofData

Gigaby

Various Input Sources

Apache Kafka

distributed publish-subscribe messaging system

Designed for processing of real time activity stream data (logs, metrics collections,

social media streams, …)

Initially developed at LinkedIn, now part of Apache

Does not use JMS API and standards

Kafka maintains feeds of messages in topics Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Apache Kafka

Kafka Broker

Temperature Processor

TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor1 2 3 4 5 6

WeatherStation

Apache Kafka

Kafka Broker

TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor

Partition0

1 2 3 4 5 6Partition0

1 2 3 4 5 6Partition1 Temperature

ProcessorWeatherStation

ApacheKafka

Kafka Broker

WeatherStation

TemperatureTopic

RainfallTopic

RainfallProcessor

1 2 3 4 5

P1 1 2 3 4 5

Kafka BrokerTemperatureTopic

RainfallTopic

P0 1 2 3 4 5

P1 1 2 3 4 5

P0 1 2 3 4 5

Discretized Stream (DStream)

WeatherStation

WeatherStation Discretebytime

IndividualEvent

DStream =RDD

DStream DStream

XSeconds

Transform

.countByValue()

.reduceByKey()

Discretized Stream (DStream)time1 time2 time3

message

timen….

f(message 1)RDD@time1

f(message 2)

f(message n)

message 1RDD@time1

message 2

message n

result 1

result 2

result n

message message message

f(message 2)

f(message n)

message 1RDD@time2

message 2

message n

result 1

result 2

result n

f(message 2)

f(message n)

message 1RDD@time3

message 2

message n

result 1

result 2

result n

f(message 1)RDD@timen

f(message 2)

f(message n)

message 1RDD@timen

message 2

message n

result 1

result 2

result n

InputStream

EventDStream

MappedDStreammap()

saveAsHadoopFiles()

Time Increasing

DStream

Transformation

Lineage

nsTrig

SparkJobs

Adapted fromChrisFregly: http://slidesha.re/11PP7FV

Apache Spark Streaming – Core concepts

Discretized Stream (DStream)• Core Spark Streaming abstraction

• micro batches of RDD’s

• Operations similar to RDD

Input DStreams• Represents the stream of raw data received

from streaming sources

• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.

• Custom Sources can be easily written for custom data sources

Operations• Same as Spark Core + Additional Stateful

transformations (window, reduceByWindow)

Apache Cassandra

Apache Cassandra™ is a free

• Distributed…

• High performance…

• Extremely scalable…

• Fault tolerant (i.e. no single point of failure)…

post-relational database solution

Optimized for high write throughput

Apache Cassandra - HistoryBigtable Dynamo

Motivation - Why NoSQL Databases?

aaa • Dynamo Paper (2007)

• How to build a data store that is

• Reliable

• Performant

• “Always On”

• Nothing new and shiny• 24 other papers cited

• Evolutionary

• Google Big Table (2006)

• Richer data model

• 1 key and lot’s of values

• Fast sequential access

• 38 other papers cited

• Cassandra Paper (2008)

• Distributed features of Dynamo

• Data Model and storage from BigTable

• February 2010 graduated to a top-level Apache

Project

Apache Cassandra – More than one server

All nodes participate in a clusterShared nothingAdd or remove as neededMore capacity? Add more servers

Node is a basic unit inside a cluster

Each node owns a range of partitionsConsistent Hashing

Node4 [26-50]

[0-25]

[51-75]

[76-100] [0-25]

[0-25][26-50]

[26-50][51-75]

[51-75][76-100]

[76-100]

Apache Cassandra – Fully Replicated

Client writes localData syncs across WANReplication per Data Center

WestEastClient

Apache Cassandra

What is Cassandra NOT?

• A Data Ocean• A Data Lake• A Data Pond

• An In-Memory Database

• A Key-Value Store

• Not for Data Warehousing

What are good use cases?

• Product Catalog / Playlists

• Personalization (Ads, Recommendations)

• Fraud Detection

• Time Series (Finance, Smart Meter)

• IoT / Sensor Data

• Graph / Network data

How Cassandra stores data

• Model brought from Google Bigtable• Row Key and a lot of columns• Column names sorted (UTF8, Int, Timestamp, etc.)

ColumnName … Column Name

ColumnValue ColumnValue

Timestamp Timestamp

TTL TTL

RowKey

1 2Billion

Billion

Combining Spark & Cassandra

Spark and Cassandra Architecture – Great Combo

Goodatanalyzingahugeamountofdata

Goodatstoringahugeamountofdata

Spark and Cassandra Architecture

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)

SparkConnector

WeatherStation

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)

WeatherStation

• Single Node running Cassandra

• Spark Worker is really small

• Spark Master lives outside a node

• Spark Worker starts Spark Executer in separate JVM

• Node local

Worker

Master

Executer

Server

Executer

Worker

Master

Worker

• Each node runs Spark and Cassandra

• Spark Master can make decisions based on Token Ranges

• Spark likes to work on small partitions of data across a large cluster

• Cassandra likes to spread out data in a large cluster

76-100

Willonly havetoanalyze25%

ofdata!

Master0-25

76-100

Worker

WorkerWorker

76-100

Transactional Analytics

Cassandra and Spark

Cassandra Cassandra&Spark

JoinsandUnions No Yes

Transformations Limited Yes

OutsideDataIntegration No Yes

Aggregations Limited Yes

Summary

Kafka• Topics store information broken into

partitions• Brokers store partitions• Partitions are replicated for data

resilience

Cassandra• Goals of Apache Cassandra are all

about staying online and performant• Best for applications close to your users• Partitions are similar data grouped by a

partition key

Spark• Replacement for Hadoop Map Reduce• In memory• More operations than just Map and Reduce• Makes data analysis easier• Spark Streaming can take a variety of sources

Spark + Cassandra• Cassandra acts as the storage layer for Spark• Deploy in a mixed cluster configuration• Spark executors access Cassandra using the

DataStax connector

Lambda Architecture with Spark/Cassandra

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

Lambda Architecture with Spark/Cassandra

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

Guido SchmutzTechnology Manager

guido.schmutz@trivadis.com

real-time analytics with apache cassandra and apache spark

Software

cassandra day sv 2014: spark, shark, and apache cassandra

using apache spark, apache kafka and apache cassandra ·...

cassandra community webinar: apache spark analytics at the...

instaclustr - apache spark and apache cassandra to power...

cassandra community webinar: apache cassandra internals

about "apache cassandra"

indexing 3-dimensional trajectories: apache spark and...

cassandra spark integration - university of southern...

cassandra spark connector

nike tech talk: double down on apache cassandra and spark

spark cassandra 2016

big data smack: a guide to apache spark, mesos, akka,...

instaclustr webinar 50,000 transactions per second with...

cassandra summit 2014: apache spark - the sdk for all big...

introduction to big data with apache spark - edx€¦ ·...

apache cassandra from the ground up -...

spark and cassandra: an amazing apache love story by patrick...

apache cassandra & apache spark for time series data

instaclustr - apache spark and apache cassandra to power...

advanced apache spark meetup data sources api cassandra...