big data-driven applications with cassandra and spark
Post on 23-Jan-2018
530 Views
Preview:
TRANSCRIPT
Big Data-Driven Applications
with Cassandra and Spark
Artem Chebotko, Ph.D.
Solution Architect
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
2
Modern Application Requirements
• Numerous Endpoints
• Geographically Distributed
• Continuously Available
• Instantaneously Responsive
• Immediately Decisive
• Predictably Scalable
3
Applications by Response Time and Workload
4
Analytical (OLAP)Operational (OLTP)
Applications by Response Time and Workload
5
Real-time transactions
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
Applications by Response Time and Workload
6
Real-time transactions
Real-time analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
Applications by Response Time and Workload
7
Real-time transactions
Real-time analytics
Streaming analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
Applications by Response Time and Workload
8
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Roles of Cassandra and Spark
9
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Numerous Endpoints
Geographically Distributed
Continuously Available
Instantaneously Responsive
Immediately Decisive
Predictably Scalable
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
10
Cassandra – Operational Database
• Millions of concurrent users
• Millisecond response time
• Linear scalability
• Always on
11
Cassandra – Operational Database
12
Spark – Analytics Platform
• Real-time, streaming and batch analytics
• Up to 100x faster than Hadoop
• Scalability, fault-resilience
• Versatile and rich API
13
Spark – Analytics Platform
14
SQL Streaming MLlib GraphX
Cluster Manager
Standalone YARN Mesos
Spark-Cassandra Connector
Open-Source Package for Spark
• Routine Spark-Cassandra interactions
– Read from and write into Cassandra
• Profound optimizations
– Predicate pushdown
– Data locality
– Cassandra-optimized joins
– Cassandra-aware partitioning
– Shuffle-free grouping
15
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
16
C*: Distributed, Shared Nothing, Peer-to-Peer
17
C* Client
C*
C*
C*
-263
+263-1
Driver
C* Client Driver
transaction
transaction
transaction
transaction
C*: Partitioning and Replication
18
replica 2
replica 1
replica 3
coordinatorpartitioner
partition
partition
key
write request
acknowledgment
CL=QUORUM
RF=3
...
...
TABLE
C*: Partitioning and Replication
19
replica 2
replica 1
replica 3
partition
partition
key
result
CL=ONE
RF=3
coordinatorpartitioner
read request
Spark: Master-Worker, Failover Masters
20
Spark
ClientDriver
Master
Worker
SparkContext
Spark
ClientDriver
SparkContextWorker
Worker
Executor Executor
Executor Executor
Executor Executor
Spark: Computation Scheduling
21
Driver
SparkContextDAG Job 0
Stage 1
task task
task
Stage 0
task task
task
Stage 2
task task
task
Job 1
Stage 4
task task
task
Stage 3
task task
task
Stage 5
task task
task
Executor
task
cache
task
Executor
task
cache
task
Executor
task
cache
task
Spark-Cassandra Connector
22
C*
C*
C*
Master WorkerExecutor
Executor
Spark-Cassandra Connector
WorkerExecutor
Executor
Spark-Cassandra
Connector
WorkerExecutor
Executor
Spark-Cassandra
Connector
Spark-Cassandra Connector
23
C*
C*
C*
Master WorkerExecutor
Executor
Spark-Cassandra Connector
WorkerExecutor
Executor
Spark-Cassandra
Connector
WorkerExecutor
Executor
Spark-Cassandra
Connector
Clu
ste
r N
ode
Spark NodeMaster JVM
Connector.jar
Worker JVMExecutor JVM
Executor JVM
C* NodeC* JVM
Multi-DC Deployment and Workload Separation
24
C* Client Driver Spark
ClientDriver
SparkContextC* Client Driver
C*
C*
C*
Master
WorkerExecutor
WorkerWorkerC*
C*
C*
Executor
Executor
ExecutorExecutor Executor
Executor
C*C*
C*
Data
Replication
C* Client Driver
Spark
ClientDriver
SparkContext
real-time
transactions
interactive and
batch analytics
DC
Operations
DC
Analytics
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
25
Getting Started with Cassandra and Spark Applications
• Data Model and Cassandra Query Language
• Core Spark and Spark-Cassandra Connector
26
Keyspace and Replication
27
CREATE KEYSPACE iot
WITH replication = {'class': 'NetworkTopologyStrategy',
'DC-Kyiv-Operations' : 3,
'DC-Houston-Analytics': 2};
USE iot;
Table with Single-Row Partitions
28
username age address
Alice 28 Santa Clara, CA
Alex 37 Austin, TX
users CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
SELECT * FROM users
WHERE username = ?;
Table with Single-Row Partitions
29
id type settings owner
1 phone {gps ⇒ on,
pedometer ⇒ on}
Alice
2 wristband {heart rate ⇒ on, …} Alice
3 thermostat {temp ⇒ 75, …} Alice
4 security {…} Alex
5 phone {…} Alex
sensors CREATE TABLE sensors (
id INT,
type TEXT,
settings MAP<TEXT,TEXT>,
owner TEXT,
PRIMARY KEY(id)
);
SELECT * FROM sensors
WHERE id = ?;
Table with Multi-Row Partitions
30
username id type settings age address
Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA
Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA
Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA
Alex 4 security … 37 Austin, TX
Alex 5 phone … 37 Austin, TX
sensors_by_user
AS
CA
SC
Table with Multi-Row Partitions
CREATE TABLE sensors_by_user (
username TEXT, age INT STATIC, address TEXT STATIC,
id INT, type TEXT, settings MAP<TEXT,TEXT>,
PRIMARY KEY(username, id)
) WITH CLUSTERING ORDER BY (id ASC);
SELECT * FROM sensors_by_user WHERE username = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id > ?
ORDER BY id DESC;
31
Retrieving Data from C*
• SparkContext, RDD, Connector
32
val rdd = sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
Predicate Pushdown
33
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
• Suboptimal code
Predicate Pushdown
34
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
.where("username = 'Alice'")
• Predicate pushed down to C*
Data Locality
input.split.size_in_mb input.consistency.level input.fetch.size_in_rows
35
Cassandra Spark
Node
(64) (LOCAL_ONE) (1000)
• Standard Spark join = shuffle + shuffle
Cassandra-Optimized Joins
36
val s = sc.cassandraTable("iot","sensors")
.keyBy(row => row.getString("owner"))
val u = sc.cassandraTable("iot","users")
.keyBy(row => row.getString("username"))
s.join(u)
Cassandra-Optimized Joins
37
• Shuffle
B
Partition 1
A C D
Map Task
Partition A
Reduce Task
B
Partition 2
A C D
Map Task
Partition B
Reduce Task
B
Partition 3
A C D
Map Task
Partition D
Reduce Task
Partition C
Reduce Task
Buckets:
memory
memory
Shuffle write
Shuffle read
disk
disk
Aggregation Aggregation Aggregation
Aggregation AggregationAggregationAggregation
• Connector join = no shuffle + no data locality
Cassandra-Optimized Joins
38
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Optimized Joins
39
id type owner
username
1 … Alice
4 … Alex
3 ... Alice
2 … Alice
5 … Alexusername age address
Alex 37 …
username age address
Alice 28 …
• Connector join + CAP = shuffle + data locality
Cassandra-Aware Partitioning
40
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.repartitionByCassandraReplica("iot","users")
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Aware Partitioning
41
id type owner
username
1 … Alice
2 … Alice
3 ... Alice
username age address
Alex 37 …
username age address
Alice 28 …
id type owner
username
4 … Alex
5 … Alex
• Suboptimal code
Shuffle-Free Grouping
42
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKey
• Shuffling eliminated at no extra cost
Shuffle-Free Grouping
43
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKeyspanByKey
Saving Data to C*
44
rdd.saveToCassandra("iot","users",
SomeColumns("username", "age"))
output.consistency.level (LOCAL_QUORUM)
output.batch.grouping.key (Partition)
output.batch.size.bytes (1024)
output.batch.grouping.buffer.size (1000)
output.concurrent.writes (5)
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
45
Artem Chebotko
achebotko@datastax.com
www.linkedin.com/in/artemchebotko
46
top related