real-time analytics with apache cassandra and apache spark,

57
BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH Real - Time Analytics with Apache Cassandra and Apache Spark Guido Schmutz

Upload: swiss-data-forum-swiss-data-forum

Post on 26-Jan-2017

189 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Real-Time Analytics with Apache Cassandra and Apache Spark,

BÂLE BERNE BRUGG DUSSELDORF FRANCFORT S.M. FRIBOURG E.BR. GENÈVE HAMBOURG COPENHAGUE LAUSANNE MUNICH STUTTGART VIENNE ZURICH

Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz

Page 2: Real-Time Analytics with Apache Cassandra and Apache Spark,

Guido Schmutz

• Working for Trivadis for more than 18 years• Oracle ACE Director for Fusion Middleware and SOA• Author of different books• Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data• Technology Manager @ Trivadis

• More than 25 years of software development experience

• Contact: [email protected]• Blog: http://guidoschmutz.wordpress.com• Twitter: gschmutz

Page 3: Real-Time Analytics with Apache Cassandra and Apache Spark,

Agenda

1. Introduction2. Apache Spark3. Apache Cassandra4. Combining Spark & Cassandra5. Summary

Page 4: Real-Time Analytics with Apache Cassandra and Apache Spark,

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

Page 5: Real-Time Analytics with Apache Cassandra and Apache Spark,

What is Real-Time Analytics?

What is it? Why do we need it?

How does it work?• Collect real-time data• Process data as it flows in• Data in Motion over Data at

Rest• Reports and Dashboard

access processed data TimeEvents RespondAnalyze

Shorttimetoanalyze&respond

§ Required - fornewbusinessmodels

§ Desired - forcompetitiveadvantage

Page 6: Real-Time Analytics with Apache Cassandra and Apache Spark,

Real Time Analytics Use Cases

• Algorithmic Trading

• Online Fraud Detection

• Geo Fencing

• Proximity/Location Tracking

• Intrusion detection systems

• Traffic Management

• Recommendations

• Churn detection

• Internet of Things (IoT) / Intelligence

Sensors

• Social Media/Data Analytics

• Gaming Data Feed

• …

Page 7: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Spark

Page 8: Real-Time Analytics with Apache Cassandra and Apache Spark,

Motivation – Why Apache Spark?

Hadoop MapReduce: Data Sharing on Disk

Spark: Speed up processing by using Memory instead of Disks

map reduce . . .Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

op1 op2 . . .Input

Output

Output

Page 9: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Page 10: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Spark

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

Search NoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers DataStores

Page 11: Real-Time Analytics with Apache Cassandra and Apache Spark,

Resilient Distributed Dataset (RDD)

Are• Immutable• Re-computable• Fault tolerant• Reusable

Have Transformations• Produce new RDD• Rich set of transformation available

• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

Have Actions• Start cluster computing operations• Rich set of action available

• collect(), count(), fold(), reduce(), count(), …

Page 12: Real-Time Analytics with Apache Cassandra and Apache Spark,

RDD RDD

Input Source

• File• Database• Stream• Collection

.count() ->100

Data

Page 13: Real-Time Analytics with Apache Cassandra and Apache Spark,

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Page 14: Real-Time Analytics with Apache Cassandra and Apache Spark,

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Page 15: Real-Time Analytics with Apache Cassandra and Apache Spark,

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server2

Server3

Server4

Server5

Page 16: Real-Time Analytics with Apache Cassandra and Apache Spark,

Stage 1 – reduceByKey()

Stage 1 – flatMap() + map()

Spark Workflow InputHDFSFile

HadoopRDD

MappedRDD

ShuffledRDD

TextFileOutput

sc.hapoopFile()

map()

reduceByKey()

sc.saveAsTextFile()

Transformations(Lazy)

Action(Execute

Transformations)

Master

MappedRDD

P0

P1

P3

ShuffledRDD

P0

MappedRDD

flatMap()

DAGScheduler

Page 17: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark Workflow HDFSFileInput1

HadoopRDD

FilteredRDD

MappedRDD

ShuffledRDD

HDFSFileOutput

HadoopRDD

MappedRDD

HDFSFileInput2SparkContext.hadoopFile()

SparkContext.hadoopFile()filter()

map() map()

join()

SparkContext.saveAsHadoopFile()

Transformations(Lazy)

Action(ExecuteTransformations)

Page 18: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark Execution Model

DataStorage

Worker

Master

Executer

Executer

Server

Executer

Page 19: Real-Time Analytics with Apache Cassandra and Apache Spark,

Stage 1 – flatMap() + map()

Spark Execution Model

DataStorage

Worker

Master

Executer

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

P1

P3

NarrowTransformationMaster

filter()map()sample()flatMap()

DataStorage

Worker

Executer

Page 20: Real-Time Analytics with Apache Cassandra and Apache Spark,

Stage 2 – reduceByKey()

Spark Execution Model

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

WideTransformation

Master

join()reduceByKey()union()groupByKey()

Shuffle!

DataStorage

Worker

Executer

DataStorage

Worker

Executer

Page 21: Real-Time Analytics with Apache Cassandra and Apache Spark,

Batch vs. Real-Time Processing

PetabytesofData

Gigaby

tes

PerS

econ

d

Page 22: Real-Time Analytics with Apache Cassandra and Apache Spark,

Various Input Sources

Page 23: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Kafka

distributed publish-subscribe messaging system

Designed for processing of real time activity stream data (logs, metrics collections,

social media streams, …)

Initially developed at LinkedIn, now part of Apache

Does not use JMS API and standards

Kafka maintains feeds of messages in topics Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Page 24: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Kafka

Kafka Broker

Temperature Processor

TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor1 2 3 4 5 6

WeatherStation

Page 25: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Kafka

Kafka Broker

Temperature Processor

TemperatureTopic

RainfallTopic

1 2 3 4 5 6

RainfallProcessor

Partition0

1 2 3 4 5 6Partition0

1 2 3 4 5 6Partition1 Temperature

ProcessorWeatherStation

Page 26: Real-Time Analytics with Apache Cassandra and Apache Spark,

ApacheKafka

Kafka Broker

Temperature Processor

WeatherStation

TemperatureTopic

RainfallTopic

RainfallProcessor

P0

Temperature Processor

1 2 3 4 5

P1 1 2 3 4 5

Kafka BrokerTemperatureTopic

RainfallTopic

P0 1 2 3 4 5

P1 1 2 3 4 5

P0 1 2 3 4 5

P0 1 2 3 4 5

Page 27: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)

Kafka

WeatherStation

WeatherStation

WeatherStation

Page 28: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)

Kafka

WeatherStation

WeatherStation

WeatherStation

Page 29: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)

Kafka

WeatherStation

WeatherStation

WeatherStation

Page 30: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)

Kafka

WeatherStation

WeatherStation

WeatherStation Discretebytime

IndividualEvent

DStream =RDD

Page 31: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)

DStream DStream

XSeconds

Transform

.countByValue()

.reduceByKey()

.join

.map

Page 32: Real-Time Analytics with Apache Cassandra and Apache Spark,

Discretized Stream (DStream)time1 time2 time3

message

timen….

f(message 1)RDD@time1

f(message 2)

f(message n)

….

message 1RDD@time1

message 2

message n

….

result 1

result 2

result n

….

message message message

f(message 1)RDD@time2

f(message 2)

f(message n)

….

message 1RDD@time2

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@time3

f(message 2)

f(message n)

….

message 1RDD@time3

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@timen

f(message 2)

f(message n)

….

message 1RDD@timen

message 2

message n

….

result 1

result 2

result n

….

InputStream

EventDStream

MappedDStreammap()

saveAsHadoopFiles()

Time Increasing

DStream

Transformation

Lineage

Actio

nsTrig

ger

SparkJobs

Adapted fromChrisFregly: http://slidesha.re/11PP7FV

Page 33: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Spark Streaming – Core concepts

Discretized Stream (DStream)• Core Spark Streaming abstraction

• micro batches of RDD’s

• Operations similar to RDD

Input DStreams• Represents the stream of raw data received

from streaming sources

• Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc.

• Custom Sources can be easily written for custom data sources

Operations• Same as Spark Core + Additional Stateful

transformations (window, reduceByWindow)

Page 34: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra

Page 35: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra

Apache Cassandra™ is a free

• Distributed…

• High performance…

• Extremely scalable…

• Fault tolerant (i.e. no single point of failure)…

post-relational database solution

Optimized for high write throughput

Page 36: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra - HistoryBigtable Dynamo

Page 37: Real-Time Analytics with Apache Cassandra and Apache Spark,

Motivation - Why NoSQL Databases?

aaa • Dynamo Paper (2007)

• How to build a data store that is

• Reliable

• Performant

• “Always On”

• Nothing new and shiny• 24 other papers cited

• Evolutionary

Page 38: Real-Time Analytics with Apache Cassandra and Apache Spark,

Motivation - Why NoSQL Databases?

• Google Big Table (2006)

• Richer data model

• 1 key and lot’s of values

• Fast sequential access

• 38 other papers cited

Page 39: Real-Time Analytics with Apache Cassandra and Apache Spark,

Motivation - Why NoSQL Databases?

• Cassandra Paper (2008)

• Distributed features of Dynamo

• Data Model and storage from BigTable

• February 2010 graduated to a top-level Apache

Project

Page 40: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra – More than one server

All nodes participate in a clusterShared nothingAdd or remove as neededMore capacity? Add more servers

Node is a basic unit inside a cluster

Each node owns a range of partitionsConsistent Hashing

Node1

Node2

Node3

Node4 [26-50]

[0-25]

[51-75]

[76-100] [0-25]

[0-25][26-50]

[26-50][51-75]

[51-75][76-100]

[76-100]

Page 41: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra – Fully Replicated

Client writes localData syncs across WANReplication per Data Center

Node1

Node2

Node3

Node4

Node1

Node2

Node3

Node4

WestEastClient

Page 42: Real-Time Analytics with Apache Cassandra and Apache Spark,

Apache Cassandra

What is Cassandra NOT?

• A Data Ocean• A Data Lake• A Data Pond

• An In-Memory Database

• A Key-Value Store

• Not for Data Warehousing

What are good use cases?

• Product Catalog / Playlists

• Personalization (Ads, Recommendations)

• Fraud Detection

• Time Series (Finance, Smart Meter)

• IoT / Sensor Data

• Graph / Network data

Page 43: Real-Time Analytics with Apache Cassandra and Apache Spark,

How Cassandra stores data

• Model brought from Google Bigtable• Row Key and a lot of columns• Column names sorted (UTF8, Int, Timestamp, etc.)

ColumnName … Column Name

ColumnValue ColumnValue

Timestamp Timestamp

TTL TTL

RowKey

1 2Billion

Billion

ofR

ows

Page 44: Real-Time Analytics with Apache Cassandra and Apache Spark,

Combining Spark & Cassandra

Page 45: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture – Great Combo

Goodatanalyzingahugeamountofdata

Goodatstoringahugeamountofdata

Page 46: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)

Page 47: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture

SparkConnector

WeatherStation

SparkStreaming(NearReal-Time)

SparkSQL(StructuredData)

MLlib(MachineLearning)

GraphX(GraphAnalysis)

WeatherStation

WeatherStation

WeatherStation

WeatherStation

Page 48: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture

• Single Node running Cassandra

• Spark Worker is really small

• Spark Master lives outside a node

• Spark Worker starts Spark Executer in separate JVM

• Node local

Worker

Master

Executer

Executer

Server

Executer

Page 49: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture

Worker

Worker

Worker

Master

Worker

• Each node runs Spark and Cassandra

• Spark Master can make decisions based on Token Ranges

• Spark likes to work on small partitions of data across a large cluster

• Cassandra likes to spread out data in a large cluster

0-25

26-50

51-75

76-100

Willonly havetoanalyze25%

ofdata!

Page 50: Real-Time Analytics with Apache Cassandra and Apache Spark,

Spark and Cassandra Architecture

Master0-25

26-50

51-75

76-100

Worker

Worker

WorkerWorker

0-25

26-50

51-75

76-100

Transactional Analytics

Page 51: Real-Time Analytics with Apache Cassandra and Apache Spark,

Cassandra and Spark

Cassandra Cassandra&Spark

JoinsandUnions No Yes

Transformations Limited Yes

OutsideDataIntegration No Yes

Aggregations Limited Yes

Page 52: Real-Time Analytics with Apache Cassandra and Apache Spark,

Summary

Page 53: Real-Time Analytics with Apache Cassandra and Apache Spark,

Summary

Kafka• Topics store information broken into

partitions• Brokers store partitions• Partitions are replicated for data

resilience

Cassandra• Goals of Apache Cassandra are all

about staying online and performant• Best for applications close to your users• Partitions are similar data grouped by a

partition key

Spark• Replacement for Hadoop Map Reduce• In memory• More operations than just Map and Reduce• Makes data analysis easier• Spark Streaming can take a variety of sources

Spark + Cassandra• Cassandra acts as the storage layer for Spark• Deploy in a mixed cluster configuration• Spark executors access Cassandra using the

DataStax connector

Page 54: Real-Time Analytics with Apache Cassandra and Apache Spark,

Lambda Architecture with Spark/Cassandra

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

Page 55: Real-Time Analytics with Apache Cassandra and Apache Spark,

Lambda Architecture with Spark/Cassandra

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

Page 56: Real-Time Analytics with Apache Cassandra and Apache Spark,
Page 57: Real-Time Analytics with Apache Cassandra and Apache Spark,

Guido SchmutzTechnology Manager

[email protected]