cassandra & spark for iot
Post on 07-Jan-2017
889 Views
Preview:
TRANSCRIPT
Cassandra & Spark for IoT_
Matthias Niehoff
Cassandra
2
•Distributed database
•Highly Available
•Horizontal & Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•Chooses Availability Over Strong Consistency
Cassandra for IoT_
3
Node 1
Node 2
Node 3
Node 4
1-25
26-50 51-75
76-0
Great for Time Series Data_
4
CREATETABLEsensors(sensorIduuid,timetimeuuid,metricNametext,metricValuedouble,PRIMARYKEY(sensorId,time)
)
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
Stored sequentially on disk
Spark
5
•Open Source & Apache project since 2010
•Data processing Framework • Batch processing • Stream processing
What Is Apache Spark_
6
•Fast • up to 100 times faster than Hadoop • a lot of in-memory processing • linear scalable using more nodes
• Easy • Scala, Java and Python API • Clean Code (e.g. with lambdas in Java 8) • expanded API: map, reduce, filter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, first, count
• Fault-Tolerant • easily reproducible
Why Use Spark_
7
•RDD‘s – Resilient Distributed Dataset • Read–Only description of a collection of objects • Partitioned for distribution • Determined through transformations • Allows automatically rebuild on failure
•Operations • Transformations (map,filter,reduce...) —> new RDD • Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
8
RDD Example_
9
scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3
scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09
scala>linesWithSpark.count()res0:Long=126
Spark & Cassandra
10
•Spark Cassandra Connector by Datastax • https://github.com/datastax/spark-cassandra-connector
• Cassandra tables as Spark RDD (read & write)
• Mapping of C* tables and rows onto Java/Scala objects
• Server-Side filtering („where“)
• Included as Maven / SBT dependency in your application
Connecting Spark With Cassandra_
11
Two Datacenter - Two Purposes_
12
C*
C*
C*C*
C*
C*
C*C*
Spark WN
Spark WNSpark
WN
Spark WN
Spark Master
DC1 - Online DC2 - Analytics
Spark Streaming
13
• Real Time Processing using micro batches
• Supported sources: Files, TCP, MQTT, Kafka, Twitter,..
• Data as Discretized Stream (DStream)
• Same programming model as for batches
• All Operations of the Spark Core, SQL and MLLib
• Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
14
valssc=newStreamingContext(sc,Milliseconds(500))vallines=MQTTUtils.createStream(ssc,"tcp://localhost:1883","foo",StorageLevel.MEMORY_ONLY_SER_2)
valkeyValue=lines.map(input=>input.toLowerCase)
data.foreachRDD(_.saveToCassandra("mqtt","sensors"))
ssc.start()
//awaitmanualterminationorerrorssc.awaitTermination()
//manualterminationssc.stop()
Spark Streaming - MQTT Example_
15
Use Cases
16
•Spark Streaming • Continuous data streams • MQTT, Kafka, ZeroMQ... • Easily reliable
• Spark Core • Existing data • SQL Databases, CSV, Json...
• Use the same programming model or even the same code!
Use Cases for Spark and Cassandra in IoT_
17
Ingestion
• Real-Time Analysis • React on events • Join with existing data • Apply events on ML models
• Batch Analysis • Scheduled jobs • Analytics on the data • Train ML models
Use Cases for Spark and Cassandra in IoT_
18
Analyses
Demo
19
Questions?
Matthias Niehoff, IT-Consultant
90
codecentric AG Zeppelinstraße 2 76185 Karlsruhe, Germany
mobil: +49 (0) 172.1702676 matthias.niehoff@codecentric.de
www.codecentric.de blog.codecentric.de
matthiasniehoff
top related