glassbeam: ad-hoc analytics on internet of complex things with apache cassandra and apache spark
TRANSCRIPT
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics
on
Internet of Complex Things
with
Spark and Cassandra
Mohammed Guller
September 2015
© Copyright 2015 Glassbeam Inc.
Let’s Take a Quick Poll
Familiar with IoT
Data modelling experience in C*
Familiar with Spark
Hands-on experience with Spark
3
© Copyright 2015 Glassbeam Inc.
About Me
Principal Architect at Glassbeam
Author of an upcoming book
– “Big Data Analytics with Spark”
Founded two startups
Passionate about building new products, big data analytics, and Machine Learning
Berkeley Graduate
LinkedIn: www.linkedin.com/in/mohammedguller
Twitter: @MohammedGuller
4
© Copyright 2015 Glassbeam Inc.
Internet of Things (IoT)
5
Network of objects embedded with software for
collecting and exchanging data over the Internet
© Copyright 2015 Glassbeam Inc.
Internet of Complex Things (IoCT)
6
Data Center Devices
– Server, storage, controller
Medical Devices
– X-Ray, MRI scan, CT scan
Manufacturing Systems
Cars
Electric Vehicle Chargers
Other Complex Devices
Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket
IT&Networks Medical&HealthCare
Transporta on
EVChargers&SmartGrid
Industrial&Mfg
5
© Copyright 2015 Glassbeam Inc.
IT & Networks
Medical & Healthcare
EV Chargers & Smart Grid
Industrial & Mfg
Transportation
Glassbeam
7
Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket
IT&Networks Medical&HealthCare
Transporta on
EVChargers&SmartGrid
Industrial&Mfg
5
Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket
IT&Networks Medical&HealthCare
Transporta on
EVChargers&SmartGrid
Industrial&Mfg
5Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket
IT&Networks Medical&HealthCare
Transporta on
EVChargers&SmartGrid
Industrial&Mfg
5
Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket
IT&Networks Medical&HealthCare
Transporta on
EVChargers&SmartGrid
Industrial&Mfg
5
Advanced and
Predictive Analytics
for Connected
Product Companies
© Copyright 2015 Glassbeam Inc.
10101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000000101
01101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101000001
11101010101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000
00010101101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101
00000100100110101101001001001101011010010010011010001001101011010010010011010110100101101001101001101001101
Analytics on Operational Data
8
Operational Data
to
Powerful Insights
© Copyright 2015 Glassbeam Inc.
High-level Architecture
9
1010100010101
10101011101011
1101010100010
1001010101010
11111000101100
1000110000110
10111010011001
11110000001010
11010100111110
0010100101011
0010100101100
0100110101011
4010101000010
10100001011110
0100110101101
0010101000001
11101001111001
0011010110100
1010101010100
0101011010101
11010111101010
1000101001010
10101011111000
1011001000110
00011010111010
011
DataInges on
DataTransforma on
DataStores Middleware Applica ons
Logs(Streams/docs)
SPLLibrary
SCALAR
INFOSERVER
LogVault
Explorer
Workbench
StandardApps
CustomApps
Rules&Alerts
DirectAccessGlassbeamStudio
CloudEnablement&Automa on
S3Amazon
Rawlogs
Cassandra
ProcessedData
SolrCloud
Index
Analy csandMachinelearning
SparkSQL
SparkStreaming
MLlib
EventProcessing&RulesEngine
End to End cloud based architecture built on modern
technologies to handle any machine, any data, any cloud
* SPL (Semiotic Parsing Language) and SCALAR are patent pending technology inventions of Glassbeam
© Copyright 2015 Glassbeam Inc.
Key Properties of IoCT Data
10
Volume Terabytes of Data
Variety Multi-structured Data
Velocity Fast Paced Batch Data
Streaming Data
© Copyright 2015 Glassbeam Inc.
Why We Chose C*
11
Volume Economically Scale from Gigabytes to Terabytes of Data
Variety Store Multi-structured Data
Velocity Fast Ingest of New Data Quick Reload of Old Data
Linear Scalability
Dynamic Schema
Fast Writes
© Copyright 2015 Glassbeam Inc.
Modeling Data in C*
Different from Modeling Data in RDBMS
Queries Drive Table and Primary Key Definitions
– Primary Key Definition Limits the Kind of Queries You Can Run
– C* Does Not Support Joins
12
© Copyright 2015 Glassbeam Inc.
A Simple Table for Storing Event Data in C*
CREATE TABLE event (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
13
© Copyright 2015 Glassbeam Inc.
Another Table to Filter Events by Severity
CREATE TABLE event_by_severity (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), severity, ts)
) WITH CLUSTERING ORDER BY (severity ASC, ts DESC);
14
© Copyright 2015 Glassbeam Inc.
Yet Another Table to Filter Events by Module
CREATE TABLE event_by_module (
sys_id text,
dt timestamp,
ts timestamp,
severity text,
module text,
message text,
PRIMARY KEY ((sys_id, dt), module, ts)
) WITH CLUSTERING ORDER BY (module ASC, ts DESC);
15
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics with C*
Oxymoron
All queries Must be Known Upfront
16
© Copyright 2015 Glassbeam Inc.
Workaround Possible but Intractable
Sys_id Model Age OS City State Country
17
• sys_by_model • sys_by_os • sys_by_age • sys_by_state • sys_by_state_age • sys_by_age_state • sys_by_model_age • sys_by_age_model • sys_by_age_model_state • sys_by_model_state_age • sys_by_model_state_os
© Copyright 2015 Glassbeam Inc.
Other Barriers to Ad Hoc Queries
No Aggregation
No Group By
No Joins
18
© Copyright 2015 Glassbeam Inc.
19
What Do
I Do
Now?
© Copyright 2015 Glassbeam Inc.
20
© Copyright 2015 Glassbeam Inc.
Spark
21
Fast and General-purpose Cluster Computing
Framework for Processing Large Datasets
API in Scala, Java, Python, SQL, and R
© Copyright 2015 Glassbeam Inc.
Integrated Libraries for a Variety of Tasks
22
Spark Core
Spark SQL
GraphX Spark
Streaming MLlib &
Spark ML
© Copyright 2015 Glassbeam Inc.
One Minor Problem!
Spark Does not Have Built-in Support for C*
Built-in Support for HDFS, S3 and JDBC-compliant
Databases
23
© Copyright 2015 Glassbeam Inc.
Spark Cassandra Connector
Open Source Library for Integrating Spark with C*
Enables a Spark Application to Process Data in C* Just
Like Data from the Built-in Data Sources
24
© Copyright 2015 Glassbeam Inc.
Spark with C*
Enables Ad Hoc Analytics
CQL Limitations No Longer Apply
Query Data Using SQL/HiveQL
– Filter on Any Column
– Aggregations
– Group By
25
© Copyright 2015 Glassbeam Inc.
Ad Hoc Analytics in Spark Shell
26
© Copyright 2015 Glassbeam Inc.
Launch the Spark Shell
/path/to/spark/bin/spark-shell \
--master spark://host:7077 \
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0
27
© Copyright 2015 Glassbeam Inc.
Create a DataFrame
val events = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options( Map(
"keyspace" -> "test",
"table" -> "event"))
.load()
28
© Copyright 2015 Glassbeam Inc.
Fire Queries
events.cache()
events.select("ts", "module", "message").where($"severity" === "ERROR").show
events.select("ts", "severity", "message").where($"module" === "m1").show
events.select("ts", "message").where($"severity" === "ERROR" &&
$"module" === "m1").show
events.groupBy("severity").count()
29
© Copyright 2015 Glassbeam Inc.
Spark SQL JDBC/ODBC Server
Analyze data in C* with just SQL/HiveQL
Command Line Shell – Beeline
Graphical SQL Client – Squirrel
Data Visualization Applications – Tableau
– ZoomData
– Qlik
30
© Copyright 2015 Glassbeam Inc.
Ad hoc Analytics with Spark SQL JDBC/ODBC server
31
© Copyright 2015 Glassbeam Inc.
Start the Spark SQL JDBC Server
/path/to/spark/sbin/start-thriftserver.sh \
--master spark://hostname:7077 \
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0
32
© Copyright 2015 Glassbeam Inc.
Launch Beeline From a Terminal
/path/to/spark/bin/beeline
33
© Copyright 2015 Glassbeam Inc.
Connect to the Spark SQL JDBC Server
beeline> !connect jdbc:hive2://localhost:10000
34
© Copyright 2015 Glassbeam Inc.
Create a Temporary Table
0: jdbc:hive2://localhost:10000> CREATE TEMPORARY TABLE event
. . . . . . . . . . . . . . . .> USING org.apache.spark.sql.cassandra
. . . . . . . . . . . . . . . .> OPTIONS (
. . . . . . . . . . . . . . . .> keyspace "test",
. . . . . . . . . . . . . . . .> table "event"
. . . . . . . . . . . . . . . .> );
35
© Copyright 2015 Glassbeam Inc.
Query Data with SQL/HiveQL
...> CACHE TABLE event;
...> SELECT severity, count(1) as total FROM event GROUP BY severity;
...> SELECT module, severity, count(1) FROM event GROUP BY module, severity;
36
© Copyright 2015 Glassbeam Inc.
Caveats
Latency
Spark Query May Require Expensive Table Scan
– Reads Every Row
– Disk I / O Slow
37
© Copyright 2015 Glassbeam Inc.
Reduce the Impact of Slow Disk I / O
Cache Tables
Replace HDD with SSD
Add More Nodes
38
© Copyright 2015 Glassbeam Inc.
Recommendations
Known Queries Requiring Sub-second Response Time
– Query C* Directly
– Create Query Specific Tables
– Pre-aggregate Data
Ad Hoc Queries
– Spark
39
© Copyright 2015 Glassbeam Inc.
40