real-time analytics with apache cassandra and apache spark,

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH

Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz

Guido Schmutz

• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data• Technology Manager @ Trivadis

• More than 25 years of software development experience

• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz

Agenda

1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

What is Real-Time Analytics?

What is it? Why do we need it?

How does it work?• Collect real-time data• Process data as it flows in• Data in Motion over Data at

Rest• Reports and Dashboard

access processed data TimeEvents RespondAnalyze

Shorttimetoanalyze&respond

§ Required - fornewbusinessmodels

§ Desired - forcompetitiveadvantage

Real Time Analytics Use Cases

• Algorithmic Trading

• Online Fraud Detection

• Geo Fencing

• Proximity/Location Tracking

• Intrusion detection systems

• Traffic Management

• Recommendations

• Churn detection

• Internet of Things (IoT) / Intelligence

Sensors

• Social Media/Data Analytics

• Gaming Data Feed

• …

Apache Spark

Motivation – Why Apache Spark?

Hadoop MapReduce: Data Sharing on Disk

Spark: Speed up processing by using Memory instead of Disks

map reduce . . .Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

op1 op2 . . .Input

Output

Output

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

Search NoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers DataStores

Resilient Distributed Dataset (RDD)

Are• Immutable• Re-computable• Fault tolerant• Reusable

Have Transformations• Produce new RDD• Rich set of transformation available

• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

Have Actions• Start cluster computing operations• Rich set of action available

• collect(), count(), fold(), reduce(), count(), …

RDD RDD

Input Source

• File• Database• Stream• Collection

.count() ->100

Data

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server2

Server3

Server4

Server5

Stage 1 – reduceByKey()

Stage 1 – flatMap() + map()

Spark Workflow InputHDFSFile

HadoopRDD

MappedRDD

ShuffledRDD

TextFileOutput

sc.hapoopFile()

map()

reduceByKey()

sc.saveAsTextFile()

Transformations(Lazy)

Action(Execute

Transformations)

Master

MappedRDD

P0

P1

P3

ShuffledRDD

P0

MappedRDD

flatMap()

DAGScheduler

Spark Workflow HDFSFileInput1

HadoopRDD

FilteredRDD

MappedRDD

ShuffledRDD

HDFSFileOutput

HadoopRDD

MappedRDD

HDFSFileInput2SparkContext.hadoopFile()

SparkContext.hadoopFile()filter()

map() map()

join()

SparkContext.saveAsHadoopFile()

Transformations(Lazy)

Action(ExecuteTransformations)

Spark Execution Model

DataStorage

Worker

Master

Executer

Executer

Server

Executer

Stage 1 – flatMap() + map()


DataStorage

Worker

Master

Executer

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

P1

P3

NarrowTransformationMaster

filter()map()sample()flatMap()

DataStorage

Worker

Executer

Stage 2 – reduceByKey()


DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

WideTransformation

Master

join()reduceByKey()union()groupByKey()

Shuffle!

DataStorage

Worker

Executer

DataStorage

Worker

Executer

Batch vs. Real-Time Processing

PetabytesofData

Gigaby

tes

PerS

econ

d

Various Input Sources

Apache Kafka

distributed publish-subscribe messaging system

Designed for processing of real time activity stream data (logs, metrics collections,

social media streams, …)

Initially developed at LinkedIn, now part of Apache

Does not use JMS API and standards

Kafka maintains feeds of messages in topics Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Apache Kafka

Kafka Broker

Temperature Processor

TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor1 2 3 4 5 6

WeatherStation

Apache Kafka

Kafka Broker


TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor

Partition0

1 2 3 4 5 6Partition0

1 2 3 4 5 6Partition1 Temperature

ProcessorWeatherStation

ApacheKafka

Kafka Broker


WeatherStation

TemperatureTopic

RainfallTopic

RainfallProcessor

P0


1 2 3 4 5

P1 1 2 3 4 5

Kafka BrokerTemperatureTopic

RainfallTopic

P0 1 2 3 4 5

P1 1 2 3 4 5

P0 1 2 3 4 5

P0 1 2 3 4 5

Discretized Stream (DStream)

Kafka

WeatherStation

WeatherStation

WeatherStation


Kafka

WeatherStation

WeatherStation

WeatherStation Discretebytime

IndividualEvent

DStream =RDD


DStream DStream

XSeconds

Transform

.countByValue()

.reduceByKey()

.join

.map

Discretized Stream (DStream)time1 time2 time3

message

timen….

f(message 1)RDD@time1

f(message 2)

f(message n)

….

message 1RDD@time1

message 2

message n

….

result 1

result 2

result n

….

message message message


f(message 2)

f(message n)

….

message 1RDD@time2

message 2

message n

….

result 1

result 2

result n

….


f(message 2)

f(message n)

….

message 1RDD@time3

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@timen

f(message 2)

f(message n)

….

message 1RDD@timen

message 2

message n

….

result 1

result 2

result n

….

InputStream

EventDStream

MappedDStreammap()

saveAsHadoopFiles()

Time Increasing

DStream

Transformation

Lineage

Actio

nsTrig

ger

SparkJobs

Adapted fromChrisFregly: http://slidesha.re/11PP7FV

Apache Spark Streaming – Core concepts

Discretized Stream (DStream)• Core Spark Streaming abstraction

• micro batches of RDD’s

• Operations similar to RDD

Input DStreams• Represents the stream of raw data received

from streaming sources

• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.

• Custom Sources can be easily written for custom data sources

Operations• Same as Spark Core + Additional Stateful

transformations (window, reduceByWindow)

Apache Cassandra

Apache Cassandra

Apache Cassandra™ is a free

• Distributed…

• High performance…

• Extremely scalable…

• Fault tolerant (i.e. no single point of failure)…

post-relational database solution

Optimized for high write throughput

Apache Cassandra - HistoryBigtable Dynamo

Motivation - Why NoSQL Databases?

aaa • Dynamo Paper (2007)

• How to build a data store that is

• Reliable

• Performant

• “Always On”

• Nothing new and shiny• 24 other papers cited

• Evolutionary


• Google Big Table (2006)

• Richer data model

• 1 key and lot’s of values

• Fast sequential access

• 38 other papers cited


• Cassandra Paper (2008)

• Distributed features of Dynamo

• Data Model and storage from BigTable

• February 2010 graduated to a top-level Apache

Project

Apache Cassandra – More than one server

All nodes participate in a clusterShared nothingAdd or remove as neededMore capacity? Add more servers

Node is a basic unit inside a cluster

Each node owns a range of partitionsConsistent Hashing

Node1

Node2

Node3

Node4 [26-50]

[0-25]

[51-75]

[76-100] [0-25]

[0-25][26-50]

[26-50][51-75]

[51-75][76-100]

[76-100]

Apache Cassandra – Fully Replicated

Client writes localData syncs across WANReplication per Data Center

Node1

Node2

Node3

Node4

Node1

Node2

Node3

Node4

WestEastClient

Apache Cassandra

What is Cassandra NOT?

• A Data Ocean• A Data Lake• A Data Pond

• An In-Memory Database

• A Key-Value Store

• Not for Data Warehousing

What are good use cases?

• Product Catalog / Playlists

• Personalization (Ads, Recommendations)

• Fraud Detection

• Time Series (Finance, Smart Meter)

• IoT / Sensor Data

• Graph / Network data

How Cassandra stores data

• Model brought from Google Bigtable• Row Key and a lot of columns• Column names sorted (UTF8, Int, Timestamp, etc.)

ColumnName … Column Name

ColumnValue ColumnValue

Timestamp Timestamp

TTL TTL

RowKey

1 2Billion

Billion

ofR

ows

Combining Spark & Cassandra

Spark and Cassandra Architecture – Great Combo

Goodatanalyzingahugeamountofdata

Goodatstoringahugeamountofdata

Spark and Cassandra Architecture

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)


SparkConnector

WeatherStation

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)

WeatherStation

WeatherStation

WeatherStation

WeatherStation


• Single Node running Cassandra

• Spark Worker is really small

• Spark Master lives outside a node

• Spark Worker starts Spark Executer in separate JVM

• Node local

Worker

Master

Executer

Executer

Server

Executer


Worker

Worker

Worker

Master

Worker

• Each node runs Spark and Cassandra

• Spark Master can make decisions based on Token Ranges

• Spark likes to work on small partitions of data across a large cluster

• Cassandra likes to spread out data in a large cluster

0-25

26-50

51-75

76-100

Willonly havetoanalyze25%

ofdata!


Master0-25

26-50

51-75

76-100

Worker

Worker

WorkerWorker

0-25

26-50

51-75

76-100

Transactional Analytics

Cassandra and Spark

Cassandra Cassandra&Spark

JoinsandUnions No Yes

Transformations Limited Yes

OutsideDataIntegration No Yes

Aggregations Limited Yes

Summary

Summary

Kafka• Topics store information broken into

partitions• Brokers store partitions• Partitions are replicated for data

resilience

Cassandra• Goals of Apache Cassandra are all

about staying online and performant• Best for applications close to your users• Partitions are similar data grouped by a

partition key

Spark• Replacement for Hadoop Map Reduce• In memory• More operations than just Map and Reduce• Makes data analysis easier• Spark Streaming can take a variety of sources

Spark + Cassandra• Cassandra acts as the storage layer for Spark• Deploy in a mixed cluster configuration• Spark executors access Cassandra using the

DataStax connector

Lambda Architecture with Spark/Cassandra

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

Guido SchmutzTechnology Manager

[email protected]

real-time analytics with apache cassandra and apache spark,

Data & Analytics